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Abstract.  An  emerging  trend  in  microprocessor  design  is  to  move  com¬ 
plexity  from  a  machine’s  microarchitecture  into  its  instruction-set  ar¬ 
chitecture.  This  trend  will  allow  compilers  to  express  inter-instruction 
dependency  information  that  current  superscalar  out-of-order  machines, 
such  as  the  Pentium  III,  derive  while  performing  computation.  This  trend 
will  change  the  nature  of  microprocessor  verification:  The  microarchitec- 
tural  models  will  become  simpler;  but  their  specifications  will  become 
more  subtle. 

This  paper  explores  the  implications  that  this  trend  will  have  on  micro¬ 
processor  verification.  We  develop  an  explicitly  parallel  instruction-set 
architecture  motivated  by  Intel’s  IA-64  and  discuss  possibilities  for  mi- 
croarchitectural  implementations.  We  then  explore  correctness  criteria 
for  relating  microarchitectures  to  explicitly  parallel  instruction  sets. 


1  Introduction 

Historically,  each  generation  of  microprocessors  has  been  more  aggressive  than 
the  previous  generation  in  its  search  and  exploitation  of  instruction-level  paral¬ 
lelism  [23].  For  example,  Intel’s  Pentium  III  (which  is  based  on  the  P6  microar¬ 
chitecture  [6, 12])  maintains  a  graph  of  40  instructions,  from  which  it  analyzes 
inter-instruction  dependencies  and  dynamically  schedules  instructions  into  exe¬ 
cution  units. 

There  is  a  cost  to  this  sophistication.  Complex  superscalar  out-of-order  mi¬ 
croarchitectures  lead  to  larger,  hotter  microprocessors  that  consume  more  power 
[8].  They  are  difficult  to  design  and  debug,  and  typically  have  long  critical  paths, 
which  inhibit  faster  clock  speeds  [5].  Some  microarchitects  feel  that  the  returns 
are  diminishing  from  their  continued  investment  into  the  run-time  discovery  of 
instruction-level  parallelism  [25]. 

A  new  trend  is  developing.  Intel  [13,14],  Hewlett-Packard  [13,19],  Compaq 
[30],  Tera  [2],  Elbrus  [9]  and  others  are  all  extending  or  designing  new  instruction- 
set  architectures  with  constructs  for  explicit  parallelism.  These  features  include 
predication  [1],  speculative  load  instructions  [17],  and  annotations  that  describe 
the  dependencies  between  instructions  [28]. 

*  This  research  is  supported  in  part  by  Intel,  the  National  Science  Foundation,  the 
Defense  Advanced  Research  Projects  Agency,  and  Air  Force  Material  Command. 


What  will  these  new  instruction-sets  look  like?  How  will  we  verify  microar¬ 
chitectures  against  them?  These  are  the  questions  that  we  hope  to  address.  In 
this  paper,  we  construct  a  formal  semantics  for  an  instruction-set  architecture 
based  on  publicly  available  information  regarding  Intel’s  new  IA-64  [10].  We 
then  develop  a  clustered  microarchitectural  design,  and  discuss  its  correctness 
criteria. 


2  OA-64:  an  explicitly  parallel  instruction  set 

This  section  introduces  and  motivates  the  emerging  style  of  architecture  design 
through  the  Oregon  Architecture  (OA-64)  —  an  explicitly  parallel  instruction 
set.  OA-64  extends  a  traditional  instruction  set  in  three  ways: 

Predication  allows  for  the  conditional  execution  of  instructions. 

Speculative  loads  are  instructions  that  can  be  issued  before  the  value  they 
produce  is  needed  without  risk  of  raising  an  exception. 

Parallelism  annotations  describe  the  dependencies  between  instructions. 

To  see  how  these  features  fit  into  OA-64,  look  at  Fig.  1  which  contains  an 
OA-64  code  of  the  factorial  function: 

—  An  OA-64  program  is  a  finite  sequence  of  packets ,  where  each  packet  consists 
of  three  instructions.  OA-64  programs  are  addressed  at  the  packet-level.  That 
is,  instructions  are  fetched  in  packets,  and  branches  can  jump  only  to  the 
beginning  of  a  packet. 

-  Instructions  are  annotated  with  thread  identifiers.  For  example,  the  0  in  the 
check  js  instruction  declares  that  instructions  with  thread  identifiers  that 
are  not  equal  to  0  can  be  executed  in  any  order  with  respect  to  the  check  js 
instruction. 

-  Packets  can  be  annotated  with  a  fence  directive  (FENCE),  which  directs  the 
machine  to  retire  all  in-flight  instructions  before  executing  the  following 
packet. 

—  Instructions  are  predicated  on  boolean- valued  registers.  For  example,  the 
check  js  instruction  will  only  be  executed  if  the  value  of  p5  is  true  in  the 
current  register-file  state. 


2.1  Calculating  regions 

Thread  identifiers  and  fences  are  annotations  to  express  concurrency  information 
about  instructions.  One  useful  presentation  of  this  concurrency  information  is 
a  directed  graph  whose  nodes  are  sets  of  threads  (which  are  finite  instruction 
sequences)  that  occur  between  fence  directives.  We  call  each  set  of  threads  a 
region.  The  general  idea  is  that  that  an  OA-64  machine  will  execute  one  region 
at  a  time.  In  this  manner,  all  values  computed  in  previously  executed  regions 
are  available  to  all  threads  in  the  current  region. 
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Fig.  1.  OA-64  implementation  of  factorial  function. 


In  this  section  we  derive  the  meaning  of  the  code  in  Fig.  1  by  calculating  its 
regions.  We  assume  that  packet  100  issues  a  fence,  and  that  before  entering  this 
code,  the  machine  has  loaded  a  value  into  register  r2  with  the  speculative  load 
instruction  (load_s). 

In  packet  101,  the  check_s  instruction  declares  that  the  machine  is  about  to 
use  the  value  stored  into  r2.  It  is  at  this  point  that  the  machine  should  raise  any 
exceptions  that  might  have  been  encountered  while  speculatively  loading  data 
into  r2.  The  first  packet  also  initializes  the  values  of  registers  rl  and  r3.  Because 
r3  depends  on  the  value  of  r2,  the  check_s  instruction  must  be  executed  before 
writing  to  r3  —  this  is  expressed  by  placing  the  same  thread-identifier  (0)  in  the 
two  instructions.  The  calculation  of  rl,  however,  can  be  executed  in  any  order 
with  respect  to  the  0  thread. 

The  fence  directive  in  packet  101  instructs  the  machine  to  retire  the  active 
threads  before  executing  the  following  packet.  Because  both  packets  100  and 
101  issues  fence  directives,  packet  101  forms  its  own  region: 


where  boxes  represent  threads.  Instructions  within  a  thread  must  be  executed  in 
order.  Threads,  however,  can  be  executed  in  any  interleaving-order  with  other 
threads.  Packet  101  forms  a  region  —  therefore  the  machine  is  required  to  syn¬ 
chronize  the  state  before  executing  the  next  packet. 

Because  packet  102  is  also  fenced,  it  also  forms  its  own  region: 


The  comparison  instruction  sets  the  predicate  register  p2  to  true  if  r2  is  not 
equal  to  0.  The  value  of  p3  is  set  to  the  negation  of  p2. 

Because  packet  103  is  not  fenced,  but  packet  104  is,  the  next  region  is  formed 
from  packets  103  and  104: 


This  region  contains  5  singleton  threads.  Note  that,  if  both  p2  and  p3  were 
true,  two  threads  would  write  to  the  program  counter  (pc)  in  an  arbitrary  order. 
However,  because  p2  and  p3  are  the  negation  of  one  another,  for  a  given  run  of 
the  region  only  one  thread  will  write  to  pc. 

Assignments  to  the  program  counter  within  a  region  are  visible  to  the  ma¬ 
chine’s  fetch  mechanism  only  after  a  fence  directive  has  been  issued.  Therefore, 
a  trace  of  an  OA-64  program  can  be  viewed  as  an  infinite  path  through  the  finite 
directed  graph  formed  by  regions  and  their  successors: 


At  first  glance,  issuing  speculative  loads  and  calculating  regions  may  appear 
strange.  However  this  is  precisely  the  sort  of  control  calculation  an  out-of-order 
machine  performs  while  executing  a  traditional  program  [25],  For  example 

-  An  out-of-order  execution  core  allows  instructions  following  a  memory  load 
to  execute  before  retiring  the  load.  The  Pentium  III  temporarily  stores  com¬ 
pleted  successors  of  a  load  into  a  content-addressable  array  until  the  load  is 
retired,  and  flushes  the  array  if  the  load  raises  an  exception. 

The  OA-64  code  fragment  in  Fig.  1  uses  a  check_s  instruction  that  checks 
to  see  if  the  previously  issued  speculative  load  succeeded  before  executing 
the  instructions  that  depend  on  it. 

-  A  traditional  encoding  of  the  factorial  function  would  use  a  conditional 
branch  in  the  place  of  the  predicate  calculation.  A  machine  with  branch 
speculation  might  predict  that  the  branch  is  not  taken  and  issue  the  instruc¬ 
tions  in  the  third  packet  before  calculating  the  condition.  In  this  case  the 
branch  target  buffer  is  acting  as  the  predicate  register  file. 

The  OA-64  program  calculates  a  predicate,  issues  instructions  from  both 
sides  of  the  would-be  branch,  and  in  the  end  only  commits  the  instructions 
that  satisfy  the  predicate. 

-  In  a  traditional  instruction  set  the  encoding  of  the  factorial  function  would 
leave  much  of  the  instruction-level  parallelism  implicit.  The  scheduling  logic 
within  an  out-of-order  machine  might  analyze  the  register  references  and 
discover  that  the  subtract  and  multiply  instructions  are  not  dependent  and 
can  be  scheduled  out-of-order. 

In  OA-64,  the  compiler  (or  programmer)  declares  the  dependencies  between 
instructions.  If  the  compiler  expresses  that  the  subtract  and  multiply  in¬ 
structions  are  not  dependent,  the  machine  may  retire  them  out-of-order. 


3  Semantics  of  OA-64 

In  this  section  we  describe  a  formalization  of  OA-64  that  facilitates  the  mathe¬ 
matical  verification  of  microarchitectural  implementations.  The  meaning  of  OA- 
64  is  defined  by  a  set  of  restrictions  on  the  source  program,  an  initial  state,  and 
a  transition  relation  that  describes  how  instructions  effect  the  state. 


3.1  Source  code  restrictions 

The  following  restrictions  are  placed  on  OA-64  programs: 

-  a  multiple  packet  region  must  always  execute  at  least  one  branch  instruction; 

-  a  branch  instruction  can  only  jump  to  a  packet  that  immediately  follows  a 
packet  with  a  fence  directive,  or  to  the  first  packet  in  the  program; 

-  a  program  must  be  a  finite  sequence  of  packets; 


3.2  Initial  state  and  transition  relation 

We  view  OA-64  as  a  two-level  language  —  the  bottom  level,  or  base-level ,  is  a 
vanilla  RISC  instruction  set  with  support  for  speculative  loads;  the  top  level, 
or  concurrency-level ,  handles  predication,  thread  identifier  annotations  and  the 
fence  directives.  The  concurrency-level  language  is  used  to  express  dependencies 
between  base-level  instructions. 

The  semantics  of  OA-64  highlight  this  perspective  by  defining  a  traditional 
base-level  transition  relation  and  a  concurrency-level  transition  relation.  The 
base-level  relation  >  is  defined  over  instructions  and  pairs  of  base-level  architec¬ 
tural  states  —  called  base-states  —  which  represent  the  state  of  the  register  file 
and  memory  (the  program  counter  is  modeled  as  the  special  register  pc  in  the 
register  file).  The  expression: 

A,w  >  r 

asserts  that  instruction  w  in  state  A  can  execute  and  result  in  state  T  in  >. 
This  relation  is  simply  the  familiar  instruction-set  style  of  relation  used  in  many 
papers,  i.e.  A,  (x  <-  y  +  z)  >  A[x  h*  A(y)  +  A(z)] 

The  concurrency-level  transition  relation  ►  is  defined  in  Fig.  2  over  pairs  of 
concurrency-level  architectural  states,  called  concurrency-states ,  which  have  the 
form: 

(P,AE) 

where  P  is  a  finite  sequence  of  packets  representing  the  OA-64  program,  and 
A  is  an  base-level  state.  £  is  the  state  of  the  region,  which  is  a  finite  set  of 
finite  instruction  sequences.  Given  an  OA-64  program,  P,  the  machine’s  initial 
concurrency-state  is  the  triple: 

(P,  init,  0) 

where  init  is  an  initialized  base-state,  and  0  is  the  empty  region. 

In  the  initial  concurrency-state,  or  when  the  machine  has  completely  executed 
a  region,  the  concurrency-state  of  the  machine  will  be  in  the  following  form 

CP,A0) 

In  this  case,  the  rule  next  (in  Fig.  2)  states  that  the  machine  should  use  the 
value  of  pc  in  the  current  base-state  (zl)  to  fetch  the  next  region.  The  function 
region,  when  given  an  OA-64  program  and  an  index,  returns  the  region  pointed 
to  by  the  index.  Also,  the  base-state  is  updated  by  incrementing  the  program 
counter. 

If  the  region  in  the  current  concurrency-state  is  not  empty,  then  it  will  have 
the  form 

(P,4,(...  ,{w  if  p) 

where  (w  if  p )  is  the  first  instruction  of  an  arbitrarily  chosen  thread1 .  If,  in  the 
base-state  A,  the  value  of  p  is  false  then  the  instruction  w  is  thrown  away  (rule 

1  We  use  :  as  a  constructor  of  lists.  In  the  expression  x  :  xs,  x  is  the  first  element  in 
the  list  and  xs  represents  the  remaining  elements 


skip  in  Fig.  2).  Otherwise,  if  p  is  true  in  the  base-state,  then  a  new  base-state 
r  is  related  to  A  and  w  by  >  (rule  execute). 


(next) 


(P,A0)  ►  (P,^[pc  i->pc  +  l],region(P,Zi(pc))) 


(skip)  (P,A(...  Aw  if  P)  :  vs,...))  ►  (P,  A,  (. . .  ,  ws, . . . ))  if  ~^(p) 


(execu  e)  (P,  (. . .  ,  (id  if  p)  :  ws, . . . ))  ►  (P,  P,  (. . .  ,ws, . . . ))  1  ^ 


Fig.  2.  Concurrency-level  semantics  of  OA-64 


4  Columbia:  An  OA-64  microarchitecture 

The  advantage  of  OA-64  is  that  the  microarchitecture  can  dedicate  more  of  its 
resources  to  computation,  and  less  to  scheduling.  This  section  presents  an  outline 
of  a  possible  microarchitecture  for  OA-64. 

The  picture  in  Fig.  3  is  of  Columbia,  a  clustered  OA-64  microarchitecture. 
The  machine’s  execution  core  is  composed  of  three  independent  execution  pipelines, 
or  clusters.  At  each  cycle  Columbia  fetches  a  packet  from  the  ICache  and  feeds 
it  to  the  clusters.  In  the  case  that  a  packet  contains  a  fence  directive,  the  machine 
stops  fetching  instructions  until  all  of  the  clusters  have  been  flushed. 

Fetched  instructions  travel  from  the  ICache  to  the  Route  unit,  where  they 
are  routed  to  one  of  three  pipelined  execution  clusters  based  on  their  thread- 
identifier  (modulus  3).  The  execution  clusters  act  as  traditional  in  order  pipelined 
execution  cores,  except  that  they  share  a  communal  register  file  (RF)  and  data 
cache  (MCache).  At  each  clock  cycle  the  clusters  calculate  how  many  instruc¬ 
tions  they  can  accept  on  the  next  cycle.  The  minimum  of  these  values  is  sent  to 
the  control  logic  (because  all  of  the  instructions  in  a  packet  might  be  routed  to 
one  execution  cluster).  The  control  logic  is  also  signaled  when  all  of  the  clusters 
are  in  a  flushed  state. 

The  fetch  logic  uses  the  register  file’s  program  counter  value.  The  Valid 
circuit  determines,  based  on  whether  or  not  the  machine  is  still  servicing  a  fence 
directive,  if  the  program  counter  should  be  used  (i.e.  the  machine  has  finished 
processing  a  region). 

Notice  that,  in  contrast  to  the  large  amounts  of  interconnected  state  found 
in  superscalar  out-of-order  models,  Columbia’s  state  is  smaller  and  mostly  local 
(i.e,  local  buffers  within  execution  clusters).  This  is  good  news  for  everyone:  The 
reduced  state  will  be  simpler  for  algorithmic  formal  verification;  and  the  reduced 
interaction  between  components  will  be  good  for  deduction. 


Fig.  3.  Columbia  microarchitecture  —  pictured  here  with  three  pipeline  clusters 


5  Verification 


Explicitly  parallel  machines  aim  to  exploit  much  of  the  same  instruction-level 
parallelism  that  superscalar  out-of-order  machines  use  —  with  a  twist.  They  use 
less  hardware,  but  are  more  difficult  to  program.  It  is  therefore  natural  that 
the  verification  of  explicitly  parallel  microarchitectures  will  be  similar  to  the 
verification  of  superscalar  out-of-order  machines  —  with  a  twist.  They  will  be 
easier  to  prove  correct,  but  the  correctness  criteria  are  more  difficult  to  define. 

Assume  that,  for  a  given  microarchitectural  model,  £n  is  a  projection  rep¬ 
resenting  the  machine’s  region  state  at  time  n,  and  <5n  is  the  base-state  within 
the  microarchitecture.  In  the  case  of  Columbia,  £  is  the  contents  of  the  pipelines 
(and  their  buffers)  and  S  equals  the  contents  of  the  register  file  and  memory 
cache. 

The  criteria  that  we  advocate  for  OA-64  are,  for  a  given  program  (P),  the 
concurrency-state  formed  with  £  and  S  should  infinitely  often  enter  into  a  reach¬ 
able  concurrency-state  defined  by  the  closure  of  the  instruction-set  semantics 
(safety) 


Vn.Bn'.  n  <  n'  A  (P,zero,0)  ►  (P, £n',£n') 


and  that  the  machine  infinitely  often  makes  progress  in  the  computation  (live¬ 
ness) 


Vn.  (P,  zero,  0)  ►  (P,Jn,fn) 


3 Tl  .  Tl  <  71  A  (P ,  <5n,£n)  ^  (Pj^n'  7  ) 


The  key  here  is  regions,  which  declare  the  existence  of  synchronization  points 
—  concurrency-states  along  the  path  of  execution  in  which  threads  have  access 
to  the  results  of  computation  from  previously  executed  threads.  In  ►,  every 
concurrency-states  resulting  from  a  next  transition  is  a  synchronization  point. 
The  formulation  of  OA-64,  coupled  with  the  constraints  on  the  input  program, 
guarantee  that  regions  are  always  finite.  Therefore  OA-64  guarantees  that  the 
transition  next  will  be  applied  infinitely  often. 

Suppose  that,  for  a  given  program,  the  concurrency-state  transition  graph 
(based  on  the  region  element  of  the  concurrency-state)  has  the  following  form 


where  the  black  circles  are  the  synchronization  points.  Also,  suppose  that  the 
microarchitectural  transition  graph  (based  on  the  value  of  the  microarchitectural 
thread  state  £)  has  the  form 


where  the  black  circles  represent  microarchitectural  synchronization  points.  Be¬ 
tween  synchronization  points  the  microarchitecture  might  make  more  or  fewer 
transitions  than  the  instruction-set  architecture.  However,  when  viewing  syn¬ 
chronization  points,  the  microarchitecture’s  transitions  are  contained  by  the 
architecture.  The  verification  problem  is  then  to  demonstrate  that,  when  the 
microarchitecture  has  reached  a  synchronization  point,  the  state  of  the  register 
file  and  the  region  that  it  is  executing  relates  to  a  reachable  concurrency-state 
in  OA-64. 


OA-64 


5.1  Adapting  pipeline  flushing  methods 

When  paired  with  an  inductive  proof  over  the  infinite  path  of  regions,  the  pipeline 
flushing  method  [4]  for  pipeline  verification  can  be  adapted  to  imply  the  proposed 
safety  property. 

In  Burch  and  DilPs  formulation,  one  must  prove  the  commuting  square  for 
all  possible  instructions  I: 


In  the  setting  of  explicitly  parallel  architectures,  we  propose  letting  I  range 
over  regions  instead  of  instructions.  That  is,  assume  that  the  microarchitecture 
begins  to  execute  a  region  in  synchronization  point  si,  and  that  s2  is  the  next 
synchronization  point  resulting  from  the  execution  starting  at  si.  Let  si'  be  the 
result  of  flushing  and  projecting  out  the  architectural  state  from  si,  and  s2'  be 
the  analogous  calculation  from  s2.  Does  there  exist  a  path  in  ►  from  si'  to  s2'?. 

A  drawback  to  this  formulation  is  that  I  no  longer  has  a  clear  bound  (ie.  16 
bit  instructions).  Instead,  I  is  bounded  by  the  size  of  regions  —  which  is  not 
satisfactory  for  model  checking.  In  our  verification,  we  made  deductive  arguments 
based  on  the  fact  that  some  finite  number  of  cycles  after  fetching  a  packet  with 
a  fence  declaration,  Columbia  transitions  into  a  synchronization  point.  We  used 
a  symmetry-reduction  styled  argument  to  show  that,  if  the  microarchitecture 
fetched  an  entire  region  before  executing  (given  sufficient  buffering),  then  that  is 
the  same  as  concurrently  executing  and  fetching  that  region.  The  more  abstract 
transition  relation  calculated  from  this  symmetry  argument  was  then  compared 
to  ►.  The  final  step  was  to  show  that,  when  the  machine  has  entered  into  a 
synchronization  point,  that  it  correctly  transitions  to  the  next  region.  This  final 
step  was  proved  using  Symbolic  Trajectory  Evaluation  [15] 

A  useful  characteristic  of  Columbia-like  microarchitectural  models  is  that  the 
number  and  arrangement  of  clusters  doesn’t  affect  the  correctness  of  a  microar¬ 
chitectural  design.  This  is  because  the  transition  relation  ►  allows  for  any  order 
of  evaluation  when  many  threads  are  trying  to  write  to  a  shared  location  in  the 
object  state. 

No  matter  how  many  clusters  the  execution  core  employs,  so  long  as  the 
clusters  behave  analogously  to  >,  the  correctness  of  the  execution  core  outlined 
in  Fig.  3  can  be  abstractly  characterized  by  the  following  assertion  (certain  pre¬ 
conditions  have  been  omitted): 


{S,Sn,  (scheduleCfetchedn,^)))  ►{execute, skip}  (<S,<5„+i,(£n+i)) 


* 

where  ►{execute, skip}  is  the  closure  of  the  relation  ►  using  only  the  rules  execute 
and  skip,  and  schedule  distributes  a  packet  into  a  partial  region. 

Note  to  reviewers:  We}re  waiving  our  hands  a  bit  in  this  section.  The  state¬ 
ments  made  in  this  section  are  based  on  a  pencil- and-paper  proof .  We  are  building 
a  proof  in  Isabelle  which  should  be  done  before  a  camera-ready  version  of  this 
paper  would  be  due. 


6  Related  work 

The  work  in  this  paper  is  closely  aligned  in  approach  with  the  existing  research 
on  the  verification  of  superscalar  out-of-order  machines  [3,7,11,24,26,27],  all 
of  which  use  refinement  based  techniques  or  flushing  (which  can  be  cast  as  an 
instance  of  refinement).  In  most  of  these  papers,  extra  information  about  the 
dependencies,  which  OA-64  makes  explicit,  has  been  added  to  the  models.  For 
example,  Damm  and  Pnueli  construct  a  non-deterministic  data-flow  machine 
that  computes  the  same  result  as  the  instruction-set  architecture  and  is  refined 
by  a  Tomasulo-like  transition  system.  Of  course  their  machines  can  only  execute 
finite  instruction  streams  that  do  not  contain  branches;  but  their  abstract  data¬ 
flow  machine  is  similar  to  OA-64. 

The  instruction  set  of  the  Java  virtual  machine  includes  facilities  for  threaded 
execution.  Unfortunatly,  the  formalizations  of  the  Java  virtual  machine  have,  to 
date,  concentrated  mainly  on  type-safety  ([22],  for  example)  or  have  assumed  a 
single-threaded  semantics  (such  as  [29]). 

Techniques  from  formal  verification  have  been  used  to  automate  the  test 
generation  for  a  dual-issue  DLX  microprocessor  [16]  which  can  be  viewed  as 
a  simple  explicitly  parallel  machine.  The  Stanford  Validity  Checker  has  been 
used  to  show  properties  of  this  same  processor  [18].  However,  that  paper  focuses 
primarily  on  the  quantifier-free  logic  of  equality  with  uninterpreted  functions 
and  does  not  go  into  detail  about  the  properties  verified. 

7  Future  work 

The  upcoming  explicitly  parallel  instruction-set  architectures  will  take  many 
forms;  OA-64  is  only  one  conservative  possibility.  For  example,  the  real  instruc¬ 
tion  sets  might  allow  sychronization  between  individual  threads;  or  they  might 
allow  branch  instructions  to  take  immediate  effect  on  the  machine’s  program 
counter.  Meanwhile,  real  explicitly  parallel  microarchitectures  will  use  branch 
prediction,  or  even  out-of-order  clusters  to  improve  performance.  The  work  pre¬ 
sented  here  is  conservative  in  its  specification  and  model.  We  hope  to  verify  more 
sophisticated  microarchitectures  against  more  realistic  instruction  sets. 

The  use  of  layered  transition  relations  (►  and  t>)  has  been  invaluable  to 
the  understanding  and  verification  of  explicitly  parallel  machines.  We  hope  to 
generalize  this  notion,  with  separate  levels  for  each  instruction-set  feature:  con¬ 
currency,  predication,  speculation,  etc.  We  may  find  that  a  particular  piece  of 


a  microarchitecture  implements  a  single-level  of  an  instruction-set’s  semantics; 
which  might  allow  us  to  treat  the  other  semantic  layers  much  more  abstractly 
—  perhaps  as  uninterpreted  functions. 

Letting  the  I  range  over  entire  regions  in  Section  5.1,  while  theoretically 
interesting,  makes  algorithmic  verification  difficult.  We  hope  to  find  other  finer- 
grained  approaches  (perhaps  still  based  on  flushing)  that  imply  correctness. 

McMillan’s  use  of  symmetry  [21]  might  prove  to  be  useful  in  the  setting 
of  multiple  symmetric  execution  clusters.  It  should  be  possible  that  a  small 
set  of  cluster  configurations  could  represent  all  possible  cluster  configurations. 
McMillan  applied  this  technique  to  reduce  the  number  of  reservation  station 
and  execution  unit  pairs  in  his  model  of  Tomosulo’s  algorithm.  He  represented 
all  configurations  with  two  reservation  station  /  execution  unit  pairs  —  one  to 
represent  the  active  pair,  and  the  other  to  represent  all  other  pairs. 

From  ► ,  it  might  be  interesting  to  develop  a  reference  model  and  verify  more 
sophisticated  OA-64  models  against  it  using  the  algebraic  approach  proposed 
by  Matthews  and  Launchbury  [20].  This  will  involve  developing  (perhaps  non- 
finite  state)  circuits  that  model  the  characteristics  of  the  instruction  set  such  as 
predicated  execution,  speculative  execution,  etc,  and  then  using  algebraic  laws 
to  transform  the  microarchitectural  model  into  a  reference  machine. 
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Abstract.  We  present  a  technique  for  doing  symbolic  simulation  of  mi¬ 
croprocessor  models  in  the  functional  programming  language  Haskell.  We 
use  polymorphism  and  the  type  class  system,  a  unique  feature  of  Haskell, 
to  write  models  that  work  over  both  concrete  and  symbolic  data.  We  offer 
this  approach  as  an  alternative  to  the  technique  of  uninterpreted  con¬ 
stants.  Compared  with  previously  reported  symbolic  simulation  efforts 
in  theorem  provers,  the  performance  of  our  approach  compares  favor¬ 
ably,  and  indeed  is  several  times  faster.  We  illustrate  our  work  with  both 
a  simple  state-based  example  and  a  complex,  superscalar,  out-of-order, 
stream-based  microprocessor  model. 


1  Introduction 

Symbolic  simulation  is  becoming  an  important  technique  for  verification  of  cir¬ 
cuits.  It  can  be  used  by  itself  for  validation  of  microcode  [Gre98]  and  it  is  a  key  in¬ 
gredient  to  verification  techniques  such  as  symbolic  trajectory  evaluation  [SB95] 
and  Burch-Dill  style  microprocessor  verification  [BD94,JDB95].  Symbolic  simu¬ 
lation  executes  a  model  for  multiple  data  values  in  a  single  simulation  run.  For 
example,  a  symbolic  program  that  we  discuss  in  this  paper  takes  the  input  data 
x  and  calculates  x 4  (or  x  *  x  *  x  *  a:) . 

Symbolic  simulation  of  microprocessor  models  written  in  the  Haskell  pro¬ 
gramming  language  [PH97]  is  possible  without  extending  the  language  or  its 
compilers  and  interpreters.  When  symbolically  simulating  a  simple  microproces¬ 
sor  model,  we  achieved  performance  of  approximately  58  300  instructions  per 
second.  We  describe  how  Haskell’s  type  class  system  allows  a  symbolic  domain 
to  be  substituted  for  a  concrete  one  without  changing  the  model  or  explicitly 
passing  the  operations  on  the  domain  as  parameters.  Algebraic  manipulations 
of  values  in  the  symbolic  domain  carry  out  simplifications  similar  to  what  is 
accomplished  by  rewriting  in  theorem  provers  to  reduce  the  size  of  terms  in  the 
output. 

The  infrastructure  required  for  using  symbolic  values  and  maintaining  a  sym¬ 
bolic  state  set  is  reusable  for  simulation  of  different  models.  We  believe  the 
approach  presented  in  this  paper  may  be  applied  in  other  languages  with  user- 
defined  data  types,  polymorphism,  and  overloading.  However,  a  key  requirement 
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is  that  overloading  work  over  polymorphic  types.  Few  programming  languages 
support  this,  although  a  different  approach  using  parameterized  modules,  as  in 
SML,  might  also  work  well.  Haskell’s  elegant  integration  of  overloading  with  type 
inference  and  the  clear  semantics  of  the  language  make  it  amenable  to  formal 
verification. 


2  Example 

To  illustrate  our  technique,  we  use  the  simple,  non-pipelined,  state-based  proces¬ 
sor  model  given  in  Moore’s  paper  on  symbolic  simulation  [Moo98].  First,  we  ex¬ 
plain  the  model  and  demonstrate  concrete  simulation.  Next,  we  show  how  using 
more  general  types  for  the  data  in  the  model  makes  it  possible  to  simulate  inter¬ 
changeably  concrete  and  symbolic  values.  The  full  source  code  for  this  example  in 
Haskell  can  be  found  at  http://www.cse.ogi.edu/~nday/Papers/sym_sim.html. 


2.1  Model 

The  opcodes  of  the  simple  machine  are  described  using  a  data  type: 

data  Op  =  MOVE  Addr  Addr 
I  MOVI  Addr  Data 
I  ADD  Addr  Addr 
I  SUBI  Addr  Data 
1  JUMPZ  Addr  Loc 
I  JUMP  Loc 
1  CALL  String 
I  RET 

For  now,  interpret  the  type  names  Addr  (memory  address),  Loc  (location),  and 
Data  as  integers. 

The  machine’s  visible  state  is  captured  by  five  values:  the  program  counter, 
the  stack  pointer,  the  data  memory  (modeled  as  a  list,  and  indexed  by  integers), 
the  halt  signal,  and  the  program.  The  program  is  indexed  by  a  name  and  location 
because  separate  routines  are  stored  in  distinct  memory.  Thus,  the  program 
counter  and  elements  of  the  stack  consist  of  both  a  name  and  a  location.  The 
program  consists  of  names  with  associated  lists  of  instructions.  The  machine’s 
state  is  captured  using  the  following  data  type1: 

data  MachState  =  ST  ( (String, Loc) ,  [(String, Loc)] ,  [Data],  Bool,  Program) 

The  meaning  of  each  instruction  is  described  by  individual  functions  that 
take  a  machine  state  and  return  a  machine  state,  such  as: 

add  a  b  (ST  ( (name , loc) , stk,  mem, halt , code) )  = 
mkState  ( (name ,loc+l) ,  stk, 

put  a  (mem  ‘at*  a  +  mem  ‘at*  b)  mem,  halt,  code) 


1  In  Haskell,  list  types  are  represented  using  square  brackets  (“[  ...  ]”). 


subi  a  b  (ST  ( (name, loc) , stk,  mem, halt , code) )  = 

mkState  ( (name,loc+l) ,  stk,  put  a  ((mem  ‘at*  a)  -  b)  mem,  halt,  code) 

jumpz  a  b  (ST  ((name,loc) ,stk,  mem, halt , code))  » 
if*  ((mem  *at‘  a)  ===  0) 

(mkState  ( (name, b) , stk,  mem, halt , code)) 

(mkState  ((name,  loc  +  i) ,  stk,  mem,  halt,  code)) 

The  semantics  of  the  ADD  instruction  increase  the  program  counter  by  one  and 
put  the  result  of  the  operation  on  the  values  in  memory  locations  a  and  b 
in  memory  location  a.  The  function  at  is  an  indexing  function.  In  Haskell,  to 
use  a  regular  identifier  as  an  infix  operator,  you  surround  it  with  backquotes, 
as  we  did  above.  The  SUBI  instruction  subtracts  the  immediate  value  b  from 
the  memory  location  a.  The  JUMPZ  instruction  sets  the  program  counter  to  the 
value  b  if  memory  location  a  has  the  value  0.  The  operator  ===  is  defined  to 
be  equality  over  integers  and  if  ’  is  if-then-else.  The  function  mkState  turns  a 
tuple  into  a  state. 

The  function  execute  matches  opcodes  to  the  semantic  functions.  For  exam¬ 
ple,  execute  calls  the  semantic  function  for  ADD  as  follows,  where  s  is  a  state: 

execute  (ADD  a  b)  s  *  add  a  b  s 


2.2  Concrete  simulation 


We  can  execute  the  model  on  particular  concrete  programs.  One  of  the  example 
programs  given  in  Moore’s  paper  multiplies  the  value  in  mem[0]  by  mem[l]  using 
repeated  addition,  leaving  the  result  in  mem  [2] ,  and  clearing  mem[0] : 


prog  =  [  MOVI  2  0, 
JUMPZ  0  5, 
ADD  2  1, 
SUBI  0  1, 
JUMP  1, 

RET  ] 


—  0,  mem [2]  <-  0 

—  1,  if  mem[0]=0  goto  5 

—  2,  mem  [2]  <-  mem[l]  +  mem  [2] 

—  3,  mem[0]  <-  mem[0]  -*1 

—  4,  goto  1 

—  5,  return  to  caller 


Comments  (prefixed  by  — )  on  the  left  describe  the  meaning  of  each  instruc¬ 
tion.  Beginning  with  memory  containing  the  values  [7,11,3,4,5]  (i.e.,  mem[0] 
containing  7  and  mem[l]  containing  11),  and  executing  the  machine  for  31  cy¬ 
cles,  results  in  the  following  memory  state:  [0,11,77,4,5].  Memory  location  2 
contains  the  result  of  multiplying  7  by  11. 


2.3  Overloading:  Type  classes 

We  now  use  the  type  class  system  of  Haskell  to  make  all  the  operations  that 
manipulate  data  be  overloaded  on  both  concrete  and  symbolic  data. 

In  the  previous  concrete  simulation,  the  type  of  the  function  subi  is2: 

2  In  Haskell,  a  type  expression  is  preceded  by  a  . 


subi  : :  Addr  ->  Data  ->  MachState  ->  MachState 

To  simulate  symbolic  values,  we  will  make  it  so  that  the  type  Data  can  be  inter¬ 
preted  at  other  types  than  Int.  We  cannot  allow  Data  to  be  any  type  (i.e.,  make 
subi  polymorphic)  because  numeric  operations  are  not  defined  for  all  types.  Al¬ 
ternatively,  we  could  parameterize  subi  by  numeric  and  other  operations  that 
are  type-specific  to  the  type  of  data  in  memory  (i.e.,  symbolic  or  concrete).  This 
is  the  approach  of  Joyce-style  representation  variables  [Joy90],  where  all  the  se¬ 
mantic  functions  are  parameterized  by  what  could  become  a  long  list  of  any 
operations  that  are  type-specific  for  any  opcode. 

Our  solution  is  to  take  advantage  of  the  overloading  of  operators  provided 
by  type  classes.  A  type  class  groups  a  set  of  operations  by  the  type  they  operate 
over.  The  typechecker  is  able  to  determine  which  instance  of  the  operation  is 
being  invoked  based  on  the  type  of  its  arguments. 

The  existing  Haskell  type  class  Num  has  almost  all  the  operators  that  we  re¬ 
quire  for  data  values  for  this  example.  In  Haskell,  a  type  class  definition  declares 
the  name  of  the  class  and  the  operations  on  members  of  the  class.  The  Num  class 
has  the  following  definition: 

class  Num  a  where 
(+)  : :  a  ->  a  ->  a 

(-)  : :  a  ->  a  ->  a 

(*)  : :  a  ->  a  ->  a 

f romlnt  : :  Int  ->  a 

Following  the  first  line  in  this  class  definition  are  operators  defined  on  types 
within  this  class.  Parentheses  indicate  that  the  operation  is  infix.  The  parameter 
after  the  name  of  the  class  (a)  is  used  to  represent  a  type  belonging  to  this  class. 
The  type  signatures  of  the  operations  are  described  in  terms  of  this  type.  The 
simple  machine  only  requires  the  use  of  “+”  and  The  function  f romlnt 
turns  integers  into  values  of  type  a.  This  capability  is  very  useful  when  moving 
to  the  symbolic  domain  because  it  means  existing  uses  of  constant  integers  do  not 
have  to  be  converted  by  hand  into  their  representation  in  the  symbolic  domain 
-  f  romlnt  is  automatically  applied  to  them. 

In  Haskell,  the  type  Int  is  declared  to  be  an  instance  of  the  Num  class. 

For  the  JUMPZ  opcode,  the  equality  operation  on  data  values  is  also  needed. 
Therefore,  we  create  a  new  class  called  Word  that  inherits  all  the  operations  of 
Num  and  includes  the  operation  ===. 

class  Num  a  =>  Word  a  where 
(===)  : :  a  ->  a  ->  a 

The  use  of  the  operator  =>  in  Haskell  indicates  that  the  type  a  must  be  a 
member  of  the  type  class  Num  and  therefore  the  types  in  Word  inherit  all  of 
Num’s  operations.  The  type  Int  is  an  instance  of  the  type  Word  where  the  equals 
operator  returns  true  (1)  if  the  two  operand  integers  are  equal  and  false  (0) 
otherwise.  Boolean  values  are  treated  as  integers. 

The  type  of  values  in  memory  now  must  be  elements  of  the  type  class  Word. 
The  types  MachState  and  Program  are  parameterized  by  the  type  of  the  memory 
elements,  as  in: 


data  MachState  a  =  ST  ( (String, Loc) ,  [  (String, Loc) ] ,  [a],  Bool,  Program  a) 

Opcodes  are  also  adjusted  to  take  immediate  values  of  types  in  the  Word  class 
rather  than  just  integers.  For  example,  the  type  of  the  subi  instruction  becomes: 

subi  : :  Word  a  =>  Addr  ->  a  ->  MachState  a  ->  MachState  a 

The  definition  of  subi  does  not  change. 

Concrete  simulation  of  prog  results  in  the  same  state. 


2.4  Symbolic  simulation  of  data  flow 

Once  the  model  has  been  set  up  to  accept  memory  values  of  types  within  the 
Word  class  rather  than  just  integers,  we  can  consider  an  appropriate  symbolic 
domain.  Our  symbolic  domain  must  include  representations  of  all  operations  that 
the  model  performs  on  integers.  The  values  of  this  domain  represent  syntactic 
versions  of  the  expressions  performed  by  the  machine.  An  appropriate  symbolic 
domain  for  this  example  includes  representations  for  constants  (Const),  symbols 
(Var),  and  the  results  of  addition  and  subtraction  operations.  Using  a  recursive 
data  type,  we  describe  the  values  in  the  symbolic  domain  as: 

data  Symbo  = 

Const  Int 
I  Var  String 
I  Plus  Symbo  Symbo 
I  Minus  Symbo  Symbo 
I  Times  Symbo  Symbo 

Plus  and  Minus  will  be  used  to  represent  the  results  of  addition  and  sub¬ 
traction  operations  on  numbers.  We  include  a  representation  of  multiplication 
(Times)  because  using  algebraic  laws  we  can  simplify  expressions  involving  ad¬ 
dition  and  subtraction  to  expressions  involving  multiplication  (Section  2.5). 

Next,  we  create  an  instance  of  the  Num  and  Word  type  classes  providing  wit¬ 
nesses  showing  how  the  required  operations  of  Num  and  Word  are  implemented 
for  Symbo.  Fig.  1  shows  the  instance  declarations  for  Symbo  that  include  func¬ 
tion  definitions  (using  pattern-matching)  for  these  operations.  The  last  case  in 
the  pattern-matching  is  the  default  case.  We  assume  for  the  moment  that  the 
operands  to  the  equality  operation  will  only  be  concrete  values. 

After  providing  these  instance  declarations,  all  that  is  necessary  to  simulate 
symbolically  the  program  prog  is  to  provide  symbolic  inputs.  To  calculate  7  *  j, 
we  begin  with  memory  having  the  values, 

[7, Var  " j " , Var  "x'\Var  "yM,Var  "z"] 

The  result  of  the  program  after  31  steps  is3: 


[0,j,j  +j+j+j+j+j+  j,y»z] 


3  This  output  is  pretty  printed  to  remove  the  “Var”  and  “Const”  prefixes. 


instance  Num  Symbo  where 

Const  x  +  Const  y  =  Const  (x  +  y) 

Const  0  +  y  =  y 

x  +  Const  0  =  x 

x  +  y  =  x  ‘Plus'  y 

Const  x  -  Const  y  =  Const  (x  -  y) 
x  -  Const  0  =  x 

x  -  y  =  x  ‘Minus'  y 

Const  x  *  Const  y  =  Const  (x  *  y) 

Const  0  *  y  =  Const  0 

x  *  Const  0  =  Const  0 

Const  1  *  y  =  y 

x  *  Const  1  =  x 

x  *  y  =  x  ‘Times'  y 

fromlnt  =  Const  .  fromlnt 

instance  Word  Symbo  where 

(Const  x)  ==  (Const  y)  =  if  (x  --  y)  then  (Const  1)  else  (Const  0) 
Fig.  1.  Instance  declarations  for  “Symbo” 

This  result  shows  that  the  sequence  of  opcodes  in  the  program  performs  repeated 
addition  resulting  in  seven  additions  of  7  being  left  in  memory  position  2. 

In  this  example,  we  only  made  one  input  symbolic.  If  we  had  made  all  of 
memory  symbolic,  we  would  not  have  been  able  to  execute  the  program  because 
the  value  in  memory  location  1  is  used  to  determine  if  a  branch  is  taken.  Because 
we  have  not  yet  defined  equality  on  symbolic  values,  checking  whether  a  value 
like  Var  "i"  is  0  would  cause  a  run-time  error.  We  extend  our  example  with 
symbolic  branching  in  Section  2.6. 

Symbolic  values  in  memory  are  used  interchangeably  with  concrete  values 
in  memory  (e.g.,  7)  and  in  the  immediate  values  within  the  programs  (e.g.,  0 
in  MOVI  2  0).  The  function  fromlnt  in  the  Num  class  turns  concrete  values  into 
symbolic  values  making  this  interchangeability  possible.  Programs  running  on 
concrete  values  and  producing  concrete  output  can  still  be  run  on  the  model 
with  the  more  general  types. 

2.5  Algebraic  simplifications 

The  symbolic  domain  must  have  the  same  behavior  as  the  concrete  domain. 
For  the  case  of  numbers,  there  are  algebraic  laws  that  hold  for  the  concrete 
domain  that  can  be  used  to  simplify  the  output  of  symbolic  simulation.  For 
example,  Var  x  +  Var  x  is  equivalent  to  Const  2  *  Var  x.  These  rules  can  be 
implemented  for  the  symbolic  domain  by  augmenting  the  instance  declaration 


for  Symbo  with  cases  that  describe  the  algebraic  rules.  Two  algebraic  rules  useful 
for  the  multiplication  program  are: 

Var  x  +  Var  y  *  if  (x  ==  y)  then  Const  2  *  Var  x 

else  Var  x  'Plus*  Var  y 

((Const  x)  'Times'  (Var  y))  +  (Var  z)  = 

if  (y  ==  z)  then  (Const  (x+1))  *  (Var  y) 

else  (Const  x  'Times'  Var  y)  'Plus'  Var  z 

Using  these  algebraic  simplifications,  the  result  of  the  multiplication  program 
calculating  7 *  j  is  [0,j,7  *  j,y,z]. 

These  algebraic  simplification  rules  perform  the  same  task  as  rewriting  in  a 
theorem  prover. 


2.6  Symbolic  simulation  of  control  flow 

When  control  values  in  a  program  are  symbolic,  the  output  of  symbolic  simula¬ 
tion  captures  the  multiple  execution  paths  that  the  program  could  have  followed. 
Memory  location  1  is  a  control  value  in  the  program  prog,  because  its  value  is 
used  to  determine  whether  to  take  a  branch  or  not.  To  deal  with  symbolic  simu¬ 
lation  of  control  values,  we  have  to  extend  our  idea  of  a  state  to  include  branches 
representing  multiple  execution  paths.  We  build  this  infrastructure  on  top  of  the 
model. 

The  branching  structure  will  have  states  at  its  leaves.  The  following  is  a  data 
type  for  capturing  trees  of  states: 

data  State  f  a  = 

CondS  a  (State  f  a)  (State  f  a)  I 
Term  (f  a) 

The  type  variable  a  describes  the  type  of  the  expression  that  is  used  to  de¬ 
cide  which  branch  to  follow.  In  our  symbolic  simulation,  this  type  variable  is 
instantiated  to  Symbo.  The  type  variable  f  describes  the  form  of  the  leaf  states. 
For  the  simple  machine,  this  will  be  the  type  MachState.  Because  MachState 
is  parameterized  by  the  type  of  data  in  its  memory,  we  use  the  type  expression 
f  a,  providing  the  parameter  Symbo  to  MachState.  The  data  constructor  CondS 
represents  multiple  execution  paths  that  are  conditional  on  the  first  argument 
to  CondS. 

To  take  a  step  in  this  symbolic  machine,  each  leaf  state  must  take  a  step. 
This  may  result  in  new  branches  in  the  tree.  The  function  step.state  is  defined 
over  leaf  states  and  invokes  the  function  execute  described  in  Section  2.1.  Using 
step_state,  we  can  define  a  function  to  take  steps  over  our  symbolic  state: 

step  (Term  s)  =  step_state  s 

step  (CondS  a  b  c)  =  CondS  a  (step  b)  (step  c) 

Next,  we  need  to  extend  our  symbolic  domain  to  include  the  result  of  checking 
for  equality  over  symbolic  values.  We  add  one  new  symbolic  value: 


data  Symbo  = 

j  Equals  Symbo  Symbo 

The  definition  of  equality  in  the  instantiation  of  Symbo  as  a  member  of  the 
Word  type  class  is  now  extended  to: 

instance  Word  Symbo  where 

(Const  x)  ===  (Const  y)  =  if  (x=-y)  then  (Const  1)  (Const  0) 
a  ===  b  =  Equals  a  b 

Finally,  we  need  to  have  the  ability  to  create  branches  in  the  state  data 
structure  when  conditional  jumps  are  encountered  in  the  program  and  symbolic 
data  determines  which  branch  to  take.  The  operator  if  }  used  in  the  semantics  of 
JUMPZ  must  be  able  to  sometimes  return  a  terminal  state  and  sometimes  return  a 
branch  state.  We  use  a  multi-parameter  type  class  to  capture  the  behavior  of  if  ’ . 
A  multi-parameter  type  class  allows  you  to  constrain  multiple  types  in  a  class 
instantiation.  In  the  case  of  if  ’ ,  we  parameterize  the  type  of  the  first  argument 
(the  deciding  value),  separately  from  the  type  of  the  the  other  arguments.  The 
result  of  the  function  has  the  same  type  as  the  second  and  third  arguments. 

class  Conditional  a  b  where 
if 7  : :  a  ->  b  ->  b  ->  b 

For  working  with  concrete  states,  we  need  an  instantiation  that  uses  the  reg¬ 
ular  if-then-else  for  concrete  values.  Since  we  are  treating  Booleans  as  numbers, 
it  checks  if  its  first  argument  is  1. 

instance  Conditional  Int  (State  f  Int)  where 
if'  a  b  c  =  if  (a==l)  then  b  else  c 

When  the  first  argument  is  symbolic,  we  have  a  different  definition  of  if } 
that  returns  a  branched  state  if  the  argument  is  symbolic. 

instance  Conditional  Symbo  (State  f  Symbo)  where 
ifJ  (Const  1)  b  c  =  b 

if’  (Const  0)  b  c  =  c 

if’  a  b  c  =  CondS  a  b  c 

Now  without  having  changed  our  model,  we  have  the  necessary  ingredients 
to  simulate  symbolic  control  values4.  For  example,  if  we  run  the  program  prog 
for  20  steps,  with  all  symbolic  values  in  memory,  calculating  i  *  j  produces  the 
output  found  in  Fig.  2.  In  this  output,  we  have  included  the  value  of  the  halt 
flag  for  each  state.  If  i  is  0,  then  the  result  in  memory  location  2  is  0  and  the 

program  has  stopped.  If  i  is  1,  the  result  is  j  and  the  program  has  stopped.  The 

last  line  of  the  figure  is  for  the  case  where  i  >  4,  so  the  result  will  be  at  least 
5  *j. 

4  The  types  of  the  semantic  functions  change  to  return  a  symbolic  state  but  these  type 
changes  can  be  inferred  by  the  typechecker. 


CondS  (i  --  0) 

([i.j.O.y.z] .True) 

CondS  ((i  -  1)  ==  0) 

([i  -  l.j, j.y.z] .True) 

CondS  ((i  -  2)  ==  0) 

(Ci  -  2, j ,2  *  j ,y ,z] .True) 

CondS  <(i  -  3)  ==  0) 

( Ci  -  3, j ,3  *  j.y.z] .True) 
CondS  ( Ci  -  4)  ==  0) 

([i  -  4, j ,4  *  j.y.z] .True) 
_ ([i  -  5,j,5  *  j ,y,z] .False) 


Fig.  2.  Output  of  prog  after  20  steps  with  inputs  “i”  and  “j” 


3  Symbolic  simulation  of  a  superscalar,  out-of-order 
microarchitecture 

We  are  modifying  an  existing  Hawk  model  for  a  Pentium  II-like  microarchitec¬ 
ture  [CLM98]  to  use  the  type  class  facilities  of  Haskell  for  symbolic  simulation. 
This  design  is  a  superscalar,  out-of-order,  with  exceptions,  pipelined  architec¬ 
ture.  We  are  now  able  to  simulate  symbolic  data  flow  for  programs  running  on 
the  model. 

Hawk  is  a  Haskell-based  hardware  description  language  for  expressing  mi¬ 
croarchitecture  designs  [CLM98,MCL98].  The  value  of  Haskell’s  higher-order 
functions  and  polymorphism  are  illustrated  in  this  Hawk  model  although  we  do 
not  have  space  to  describe  them  in  this  paper. 

Hawk  models  usually  process  transactions.  A  transaction  captures  the  state 
of  an  instruction  as  it  progresses  through  the  pipeline.  A  transaction  contains 
the  address  of  the  instruction,  its  opcode,  and  the  addresses  and  values  of  its 
operands.  The  transaction  may  also  contain  a  speculative  PC.  As  the  transaction 
moves  through  the  pipeline,  values  for  input  operands  and  result  operands  get 
filled  in.  The  speculative  PC  is  compared  to  the  calculated  result  of  a  branch 
instruction  to  determine  if  the  pipeline  needs  to  be  flushed. 

The  essential  change  necessary  to  use  type  classes  in  this  design  was  to  modify 
the  values  in  registers  and  memory  to  be  of  a  type  belonging  to  the  type  class  Num 
rather  than  only  integers.  This  modification  also  affects  the  type  of  addresses 
because  calculations  unite  the  address  and  value  space.  Various  Hawk  library 
devices  that  manipulate  transactions  were  changed  to  the  more  general  type. 

The  Symbo  data  type  was  used  to  execute  a  symbolic  program  calculating 
x4  on  this  design.  Fig.  3  shows  our  representation  of  the  symbolic  DLX  [HP96] 
program.  The  comments  beside  each  instruction  indicate  the  address  where  the 
instruction  is  placed  in  memory.  The  output  of  simulating  a  Hawk  model  is  a 
stream  of  transactions  describing  the  instructions  that  have  been  executed.  Fig.  4 


prog_x_4  = 

[Immlns  (ALUImm  (Add  Signed))  R3  RO  (Var  "x")),- 

Immlns  (ALUImm  (Add  Signed) )  R4  RO  4) , 

Immlns  (ALUImm  (Add  Signed) )  R6  RO  1 , 

Immlns  (ALUImm  (Add  Signed))  R5  RO  0, 


RegReg  ALU  (S  GreaterEqual)  R1  R5  R4, 
Immlns  BNEZ  RO  R1  32, 


64:  R3  <-  RO  +  x 
65:  R4  <-  RO  +  4 
66:  R6  <-  RO  +  1 
67:  R5  <-  RO  +  0 
loop  begins  here 
68:  R4  <-  R1  >=  R5 
69:  if  (R1==0)  then 


Nop, 

RegReg  ALU  Input 1  F2  R6  RO, 

RegReg  ALU  Input 1  F3  R3  RO, 

RegReg  ALU  (Mult  Signed)  F2  F2  F3, 
RegReg  ALU  Input 1  R6  F2  RO, 

Immlns  (ALUImm  (Add  Signed))  R5  R5 
Jmp  J  ((-36)), 


goto  (70+32/4=78) 

—  70:  No_op 

—  71:  F2  <-  R6 

—  72:  F3  <-  R3 

—  73:  F2  <-  F2  *  F3 

—  74:  R6  <-  F2 

1,  —  75:  R5  <-  R5  +  1 

—  76:  goto  (77-36/4=68) 

—  end  of  loop 


Nop, 

RegReg  ALU  (Add  Signed)  R1  RO  R6, 

] 


—  77:  No_op 

—  78:  R1  <-  RO  +  R6 


Fig.  3.  Symbolic  DLX  program  for  x4 


shows  the  output  of  the  symbolic  x 4  program  for  48  cycles.  The  number  on  the 
left  is  the  cycle  that  the  transaction  leaves  the  pipeline.  Because  this  processor 
is  superscalar,  multiple  instructions  may  leave  the  pipeline  in  one  cycle.  The 
program  counter  is  after  the  cycle  number  on  an  output  line.  The  values  of  the 
registers  used  in  computation  are  given  in  parentheses.  If  the  instruction  is  a 
branch,  a  speculative  program  counter  is  included  in  the  transaction. 

We  are  currently  extending  the  Hawk  library  to  handle  symbolic  control 
paths  as  well.  The  key  to  making  this  work  is  to  have  trees  of  transactions 
flowing  along  the  wires  instead  of  just  simple  transactions.  This  is  similar  to 
how  the  state  in  the  earlier  example  became  trees  of  states.  However,  a  Hawk 
model  is  stream-based  and  therefore,  does  not  have  explicit  access  to  its  state 
like  the  earlier  example  does.  Instead  of  simply  having  a  top-level  branching  of 
state,  the  branching  of  state  must  be  threaded  through  the  entire  model,  just  as 
transactions  are.  This  means  that  most  components  will  need  to  understand  how 
to  handle  trees  of  transactions.  We  are  exploring  how  to  best  use  a  transaction 
type  class  to  define  easily  a  new  instance  of  transactions  that  are  trees. 

Once  these  modifications  to  the  Hawk  library  have  been  made,  all  future  mod¬ 
els  will  be  able  to  simulate  both  concrete  and  symbolic  programs.  The  symbolic 
domain  presented  in  this  paper  is  sufficient  for  many  microarchitectures. 


1: 

2: 

3: 

4: 

256 

R3(x)  <-  R0(0)  +  x 

260 

R4(4)  <-  R0(0)  +  4 

5: 

264 

R6(l)  <-  R0(0)  +  1 

268 

R5(0)  <-  R0(0)  +  0 

6: 

272 

R1(0)  <-  R5(0)  >=  R4(4) 

7: 

276 

PC(280)  <-  if  R1(0)  then  PC (280)  + 

32  else  PC (280) 

8: 

(SpecPC(256) ) 

16 

17 

292 

F2(x)  <-  F2(l)  *  F3(x) 

18 

296 

R6(x)  <-  F2(x) 

300 

R5(l)  <-  R5(0)  +  1 

304 

PC(272)  <-  PC (308)  +  -36 

(SpecPC(256) ) 

19 

28 

29: 

:  292 

F2(x  *  x)  <-  F2(x)  *  F3(x) 

30: 

:  296 

R6(x  *  x)  <-  F2(x  *  x) 

300 

R5(2)  <-  R5(l)  +  1 

304 

PC (272)  <-  PC (308)  +  -36 

(SpecPC(272)) 

272 

R1(0)  <-  R5(2)  >=  R4(4) 

276 

PC (280)  <-  if  R1(0)  then  PC  (280)  + 

32  else  PC (280) 

42 

43 

292 

F2(x  *  x  *  x  *  x)  <-  F2(x  *  x  *  x) 

*  F3(x) 

44 

296 

R6(x  *  x  *  x  *  x)  <-  F2(x  *  x  *  x  > 

*  x) 

300 

R5(4)  <-  R5(3)  +  1 

304 

PC(272)  <-  PC (308)  +  -36 

(SpecPC(272) ) 

272 

Rl(l)  <-  R5(4)  >=  R4(4) 

276 

PC(312)  <-  if  Rl(l)  then  PC(280)  + 

32  else  PC (280) 

(SpecPC(280) ) 

45 

46 

47 

48 

312:  Rl(x  *  x  *  x  *  x)  <-  R0(0)  +  R6(x  ■ 

*  x  *  x  *  x) 

Fig.  4.  Stream  of  transactions  resulting  from  execution  of  xA  program 


4  Performance 


In  this  section,  we  consider  the  performance  of  our  “symbolic  simulator”.  We 
used  the  Glasgow  Haskell  Compiler  Version  4.02  [Ghc]  for  running  our  tests. 
Moore  provided  timing  numbers  for  doing  symbolic  simulation  of  the  simple 
machine  in  the  theorem  prover  ACL2  on  a  200  MHz  Sun  Ultra  2  with  512 
MB  [Moo98].  Unfortunately,  we  did  not  have  an  equivalent  platform  available 
and  ran  our  test  cases  on  a  450  MHz  Intel  Pentium  II  with  512  MB  memory. 
Based  on  SPEC  CPU95  integer  benchmarks,  our  platform  is  roughly  two  and 
half  times  faster  than  Moore’s  [SPE]. 

For  concrete  simulation,  the  multiplication  program  calculating  10  000  *1000 
for  40  007  cycles  took  0.53  seconds  with  ACL2  at  best  and  0.54  seconds  for  us. 
Here  we  are  comparing  Lisp  execution  to  Haskell  execution.  On  a  larger  concrete 
test  case  taking  400  000  cycles  for  100  000  *  1000,  we  achieved  approximately 
62  200  instructions  per  cycle  (IPC). 

In  ACL2,  the  multiplication  program  with  symbolic  data  flow  calculating 
1000  *  j  for  4005  cycles  took  at  best  17  seconds  with  hints  and  at  worst  55 
seconds  (IPCs  of  72  and  235  respectively).  Running  the  same  symbolic  program 
took  0.04  seconds  for  us.  When  running  a  much  larger  test  case  of  100  000  *  j 
for  400  000  instructions  (no  branches)  we  achieved  58  300  IPC. 

The  multiplication  program  with  symbolic  control  flow  calculating  i  *  j  for 
2000  cycles  took  1.55  seconds,  which  is  approximately  1290  IPC.  With  branching 
symbolic  programs,  printing  time  is  significant. 

ACL2  must  use  its  rewrite  engine  for  symbolic  simulation,  whereas  our  ap¬ 
proach  involves  executing  a  functional  program.  Therefore,  we  do  not  suffer 
a  performance  penalty  for  symbolic  simulation.  Rewriting  requires  searching  a 
database  of  rewrite  rules  and  potentially  following  unused  simplifications  [Moo98]. 

5  Related  Work 

The  approach  described  in  this  paper  is  closely  related  to  work  on  Lava  [BCSS98], 
another  Haskell-based  hardware  description  language.  They  have  focused  mainly 
on  gate-level  descriptions,  but  Lava  has  also  been  used  for  signal-processing  ap¬ 
plications.  Lava  has  explored  using  Haskell  features,  such  as  monads ,  to  provide 
alternative  interpretations  of  circuit  descriptions  for  simulation,  verification,  and 
generation  of  code  from  the  same  model.  A  predominant  use  of  a  symbolic  circuit 
interpretation  in  Lava  is  to  produce  output  for  theorem  provers.  Consequently, 
their  symbolic  simulation  assigns  labels  to  all  subterms  and  produces  a  sequence 
of  assertions  relating  symbolic  inputs  to  outputs.  This  is  like  using  pointers  to 
build  a  branching  data  structure.  Because  pointers  are  untyped,  this  represen¬ 
tation  loses  some  of  the  type  information  of  the  expression.  Also,  a  symbolic 
interpretation  must  be  applied  to  all  parts  of  the  circuit.  Our  emphasis  has  been 
more  on  building  symbolic  simulation  on  top  of  the  simulation  provided  by  the 
execution  of  a  model  as  a  functional  program.  In  our  descriptions  of  micropro¬ 
cessors,  we  rely  on  the  standard  meaning  of  function  application  to  connect 


components  of  circuits.  We  use  type  classes  extensively  to  choose  between  a 
symbolic  interpretation  or  a  non-symbolic  interpretation  of  an  operation.  Both 
interpretations  can  be  used  within  the  same  simulation  run.  To  achieve  this  flex¬ 
ibility,  we  build  the  branching  structure  into  the  symbolic  domain  and  use  type 
classes  to  capture  the  symbolic  operations.  The  branching  structure  is  threaded 
through  the  model.  This  threading  relies  on  multi-parameter  type  classes  -  a 
recent  extension  to  Haskell. 

Symbols  in  Lisp  can  be  used  for  symbolic  simulation.  For  example,  to  generate 
expressions  for  input  to  the  Stanford  Validity  Checker  [JDB95],  a  simple  HDL 
based  on  Common  Lisp  is  used  [BD94].  In  this  paper,  we  show  how  this  approach 
can  be  done  in  a  strongly-typed,  higher-order,  functional  programming  language. 

Symbolic  simulation  can  be  carried  out  with  uninterpreted  constants  using 
rewriting  techniques  in  a  theorem  prover  (e.g.,  [Joy89,Win90,Moo98,Gre98])  or 
using  more  specialized  techniques  such  as  symbolic  functional  evaluation  [DJJ. 
In  this  form  of  symbolic  simulation,  the  model  is  executed  over  constants  of 
unknown  value  but  the  same  type  as  a  concrete  value.  It  does  not  require  any 
changes  to  the  model.  However,  uninterpreted  constants  are  an  element  of  logic 
and  their  use  requires  the  model  to  be  expressed  in  a  logic.  Simulation  of  a 
logical  specification  requires  special-purpose  infrastructure  such  as  rewriting  or 
a  means  of  partial  evaluation.  Our  symbolic  domain  provides  the  same  effect  as 
uninterpreted  constants  using  a  general-purpose  programming  language. 

Type  classes  provide  the  infrastructure  needed  to  support  the  way  uninter¬ 
preted  constants  have  been  used  in  logical  models  of  microprocessors.  Taking  ad¬ 
vantage  of  polymorphism  in  higher-order  logic,  Joyce  first  used  “representation 
variables”  to  bundle  operations  on  data  [Joy90].  These  operations  parameterize 
both  a  reference  machine  and  a  model  of  the  implementation.  The  verification  ef¬ 
fort  is  valid  for  any  instantiation  of  these  operations.  Having  an  object-oriented 
flavor,  a  type  class  packages  the  functions  of  a  representation  variable  in  one 
location.  It  is  not  necessary  to  parameterize  all  components  of  the  model  by 
type-specific  operations.  We  provide  instantiations  of  the  operations  of  the  type 
class  for  both  concrete  and  symbolic  simulation. 

Graph  structures  such  as  BDDs  and  MDGs  represent  symbolic  formulae.  Bi¬ 
nary  decision  diagrams  (BDDs)  [Bry86]  are  a  canonical  form  for  propositional 
logic.  Multiway  decision  diagrams  (MDGs)  [CZS+94]  are  a  canonical  representa¬ 
tion  of  formulae  in  many  sorted,  first-order  logic  (including  uninterpreted  func¬ 
tions).  In  both  cases,  by  iterating  a  next  state  relation,  these  representations  can 
be  used  to  carry  out  symbolic  simulation.  BDDs  and  MDGs  are  used  in  decision 
procedures  because  of  their  canonical  form.  Our  form  of  symbolic  simulation  for 
higher-order  expressions  only  calculates  terms  and  does  not  produce  a  canoni¬ 
cal  form.  We  have  not  yet  characterized  the  “decidability”  of  verification  efforts 
involving  the  symbolic  terms  we  produce. 


6  Limitations 


Our  approach  is  limited  to  models  expressed  as  functions,  although  they  may  be 
either  state-based  as  in  Moore’s  simple  example  or  stream-based  as  in  the  Hawk 
Pentium  II-like  model. 

Compared  to  carrying  out  symbolic  simulation  in  a  logic,  in  our  approach  it 
is  necessary  to  introduce  a  term  structure  for  the  symbolic  domain.  Our  symbols 
differ  from  uninterpreted  constants  in  logic  in  that  a  programming  language  has 
a  built-in  assumption  that  elements  of  user-defined  types  are  distinct.  Creating 
the  symbolic  term  structure  requires  care  because  the  symbolic  domain  must 
have  the  same  properties  of  the  concrete  domain.  Therefore,  the  usual  equal¬ 
ity  operation  is  only  defined  for  the  symbolic  domain  in  special  cases  such  as 
Var  x  =  Vax  x.  In  this  paper,  we  do  not  address  the  issues  of  how  one  ensures 
the  symbolic  domain  has  the  same  properties  as  the  concrete  domain. 

Our  symbolic  simulation  cannot  determine  when  multiple  symbolic  decision 
points  conflict  and  therefore  prune  impossible  execution  paths. 

Finally,  type  classes  can  make  fixing  type  errors  a  more  difficult  process.  For 
example,  type  errors  are  often  masked  as  missing  class  instantiations. 

7  Conclusion 

The  most  important  conclusion  of  this  work  is  that  facilities  can  be  found  within 
some  existing  programming  languages  to  carry  out  symbolic  simulation  of  mi¬ 
croprocessor  models.  Using  a  programming  language  means  symbolic  simulation 
is  accomplished  by  simply  running  a  program.  The  speed  of  our  method  com¬ 
pares  well  with  using  rewriting  techniques  to  carry  out  symbolic  simulation.  The 
output  of  symbolic  simulation  produced  by  a  model  written  in  a  programming 
language  or  executable  hardware  description  language  can  be  used  as  input  to 
verification  tools. 

Type  classes  in  Haskell  make  it  possible  to  simulate  interchangeably  concrete 
and  symbolic  values  without  changing  the  model.  Type  classes  provide  a  way 
to  exchange  domains  of  values  without  requiring  explicit  parameterization.  The 
class  definition  specifies  the  operations  on  both  the  symbolic  and  concrete  do¬ 
mains.  Algebraic  manipulations  of  values  in  the  symbolic  domain  reduce  the  size 
of  the  symbolic  terms  in  the  output. 

The  symbolic  infrastructure  is  likely  to  be  reusable  for  future  microproces¬ 
sor  models.  Thus,  the  initial  investment  in  setting  up  the  type  classes  can  be 
amortized  over  the  ability  to  simulate  symbolically  many  models. 

We  intend  to  continue  this  work  by  considering  how  this  form  of  symbolic 
simulation  can  be  used  in  verification  techniques.  For  example,  symbolic  trajec¬ 
tory  evaluation  (STE)  [SB95]  is  currently  being  applied  at  the  bit-level  using 
BDDs  as  a  symbolic  representation.  To  apply  STE  at  a  more  abstract  level  a 
means  of  symbolic  simulation  of  abstract  values,  such  as  the  one  we  have  pre¬ 
sented,  is  needed.  We  intend  to  investigate  the  use  of  STE  for  microarchitecture 
verification  leveraging  off  of  this  work  on  symbolic  simulation. 
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ABSTRACT 

Empirical  software  engineering  often  faces  the  challenge 
of  large  variability  of  results  among  individual  subjects. 
Variability  can  be  reduced  by  using  a  larger  group  of 
subjects,  but  such  group  quickly  becomes  too  expensive. 
Another  challenge  is  finding  a  group  of  subjects  that  is 
representative  of  some  relevant  population  of  software 
engineers.  This  paper  explores  the  potential  of  using 
the  internet  as  the  medium  for  software  engineering  ex¬ 
periments  to  address  the  problems  of  sample  size  and 
representativeness . 
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1  INTRODUCTION 

Software  engineering  experiments  provide  information 
that  helps  improve  software  development  process.  At 
the  same  time,  experimental  findings  in  software  engi¬ 
neering  are  sparse.  One  of  the  problems  in  software 
engineering  experimentation  is  high  variability  between 
results  of  individuals  that  prompts  for  larger  samples. 
The  cost  of  experiments  increases  very  rapidly  with  the 
increase  of  the  number  of  subjects  involved.  Finding 
a  large  inexpensive  pool  of  subjects  for  software  engi¬ 
neering  experiments  is  not  easy.  Another  problem  is 
related  to  the  sample  population,  the  subjects  of  the 
experiment.  The  goal  of  an  experiment  is  to  make  con¬ 
clusions  about  some  population  that  is  larger  than  the 
studied  sample  (the  “target  population”).  Sample  pop¬ 
ulation  in  software  engineering  experiments  often  con¬ 
sists  of  students  attending  a  certain  class,  usually  taught 
by  the  experimenter.  “Captive  subjects”  recruited  from 
a  software  engineering  class  usually  don’t  represent  any 
reasonable  population.  The  limited  number  of  subjects 
and  high  variance  in  individual  results  might  be  some 
of  the  reasons  why  software  engineering  experiments  of¬ 


ten  cannot  detect  a  statistically  significant  difference 
between  the  studied  phenomena. 

2  THE  OPPORTUNITY 

Wide  propagation  of  the  internet  in  the  recent  years  of¬ 
fered  us  a  new  way  to  address  the  problems  of  sample 
size  and  representativeness.  An  experiment  can  be  con¬ 
ducted  via  the  internet  making  it  unnecessary  for  the 
subjects  to  travel  to  the  experiment  site.  Instead,  they 
would  simply  connect  to  the  experiment  server,  view 
the  materials,  and  perform  the  required  tasks.  The  ex¬ 
periment  server  can  be  made  available  24  hours  a  day 
so  the  subjects  could  participate  on  their  own  sched¬ 
ule.  Internet-based  experiment  would  require  the  sub¬ 
jects  to  have  an  internet  connection  and  therefore  the 
set  of  subjects  in  such  experiment  can  not  be  consid¬ 
ered  representative  of  the  entire  population.  Since  the 
set  of  individuals  that  have  an  internet  connection  in¬ 
cludes  most  of  the  students,  it  can  be  considered  more 
representative  than  a  set  of  “captive  subjects”  from  a 
class. 

Internet  experimentation  offers  other  important  advan¬ 
tages  over  the  traditional  classroom-based  setup.  First, 
an  internet-based  experiment  is  easy  to  replicate  inter¬ 
nally  or  externally.  External  replication  of  experiments 
is  important  to  verify  and  validate  the  original  results. 
To  replicate  such  an  experiment,  researchers  would  only 
need  to  copy  the  internet-based  infrastructure  to  their 
own  server  and  inform  the  participants  of  the  server’s 
location.  Second,  internet-based  experiments  can  be 
much  easier  to  study  and  improve.  Even  after  the  the 
experiment  itself  is  complete,  the  web-based  infrastruc¬ 
ture  can  be  left  available  for  everyone  to  study  and  learn 
from.  Other  researchers  could  walk  through  this  in¬ 
frastructure  to  better  understand  subjects’  experiences 
long  after  the  original  experiment  had  been  completed. 
Third,  the  experiment  server  can  be  programmed  to 
capture  finer  details  of  subjects’  work  process  that  often 
escape  investigation  in  “paper-and-pencil”  experiments. 

Internet-based  experiment  can  be  conducted  much 
faster  and  at  lower  cost  than  a  traditional  classroom 
experiment,  eliminate  the  experimenter  bias  and  ensure 
that  all  subjects  are  treated  exactly  the  same.  It  also 


allows  the  subjects  to  remain  completely  anonymous. 

3  THE  PROBLEM 

Before  internet-based  experiments  become  a  standard 
tool  of  empirical  software  engineering,  research  is  re¬ 
quired  to  demonstrate  that  such  experiments  can  pro¬ 
duce  valid  results.  Internet-based  experiment  setup  re¬ 
moves  some  of  the  threats  to  validity  of  an  experiment 
such  as  experimenter  bias  and  peer  pressure,  but  it  can 
also  introduce  new  ones.  Some  of  the  potential  threats 
are:  Control:  The  degree  of  experimenter’s  control  over 
subjects  on  the  internet  is  much  less  than  in  a  classroom 
experiment,  and  violation  of  the  rules  of  the  study  by 
subjects  would  be  hard  to  detect.  Commitment:  In  an 
internet  experiment  subjects  may  feel  detached  and  less 
committed  to  the  study  than  in  a  classroom  experiment. 
Retention:  With  the  amount  and  diversity  of  informa¬ 
tion  available  on  the  internet,  subjects  will  be  tempted 
to  leave  the  experiment  site  and  “surf”  somewhere  else. 
Local  conditions:  Subjects  may  participate  in  an  inter¬ 
net  experiment  from  a  location  that  does  not  allow  them 
to  concentrate.  Technical:  They  can  experience  prob¬ 
lems  with  their  computers  or  internet  connections. 

There  are  other  factors  that  can  potentially  affect  the 
validity  of  an  internet-based  experiment.  With  a  diverse 
participant  base,  we  can  expect  most  of  these  factors 
to  be  randomly  distributed  so  that  they  will  introduce 
“noise”,  instead  of  a  bias,  into  the  experimental  results. 
A  large  number  of  participants  would  allow  us  to  collect 
enough  data  to  filter  out  the  “noise” . 

4  METHODOLOGY 

At  the  this  time  we  don’t  have  a  good  understanding 
of  all  factors  influencing  the  results  of  “internet”  par¬ 
ticipants.  To  overcome  this  problem,  we  can  aggregate 
these  factors  into  a  single  “internet  factor”  and  start  by 
studying  how  this  factor  affects  the  results  of  internet- 
based  experiments.  We  can  assess  the  internet  factor  in 
two  ways.  We  can  quantify  it  by  investigating  its  effect 
on  the  experimental  results.  To  do  this,  we  will  design 
a  study  that  includes  the  internet  factor  as  an  indepen¬ 
dent  variable.  We  can  also  try  to  better  understand  this 
factor  qualitatively  by  identifying  its  major  parts. 

Quantitative  Analysis 

We  will  start  the  quantitative  analysis  of  the  internet 
factor  by  designing  a  study  that  includes  this  factor  as 
an  independent  variable.  The  study  will  use  two  groups 
of  subjects.  One  group  will  be  recruited  locally  among 
students  of  computer  science  classes  and  computer  pro¬ 
fessionals  (the  “local”  group).  The  other  group  will  be 
recruited  on  the  internet  using  postings  in  Usenet  news- 
groups  and  submissions  to  WWW  search  engines  (the 
“internet”  group).  Both  groups  will  perform  the  same 
set  of  tasks  using  the  same  internet-based  infrastruc¬ 
ture.  The  subjects  from  the  local  group  will  perform 


the  tasks  using  computer  terminals  in  the  experiment 
lab.  The  subjects  from  the  internet  group  will  perform 
the  same  tasks  remotely  without  making  the  trip  to  the 
experiment  site.  Subjects’  task  will  be  to  apply  different 
validation  techniques  to  a  set  of  small  programs.  The 
techniques  selected  are  “functional  testing”  (validation 
without  access  to  source  code)  and  “structural  testing” 
(validation  with  access  to  source  code).  The  validation 
technique  will  become  the  second  independent  variable. 
The  set  of  programs  was  created  by  Kamsties  and  Lott 
[1]  and  later  used  in  a  replication  by  Wood  et  al.  [2]. 

The  difference  in  performance  between  local  and  inter¬ 
net  subjects  using  the  same  validation  technique  will 
allow  us  to  quantify  the  internet  factor.  It  is  possible 
that  the  internet  factor  will  introduce  a  bias,  shifting  the 
performance  of  the  internet  group  up  or  down  for  both 
techniques.  It  can  also  change  the  effect  size.  Analysis 
of  variance  will  be  used  to  determine  the  effect  of  the 
internet  factor  and  its  interaction  with  different  testing 
techniques. 

Qualitative  Analysis 

To  perform  qualitative  analysis  of  the  internet  factor 
we  can  observe  the  behavior  of  subjects  from  local  and 
internet  groups  by  studying  the  information  recorded 
by  the  experiment  server.  It  is  possible  that  subjects 
from  the  internet  group  will  be  more  impatient,  less  at¬ 
tentive,  and  less  likely  to  read  instructions.  They  may 
jump  from  page  to  page  more  quickly  and  make  more 
mistakes.  Another  way  to  collect  qualitative  informa¬ 
tion  is  to  ask  all  subjects  to  fill  out  a  questionnaire  at 
the  end  of  the  study.  The  questionnaire  will  ask  about 
subject’s  physical  conditions,  connection  speed,  possible 
interruptions,  or  other  factors  that  may  have  affected 
their  performance,  and  offer  a  space  to  provide  feed¬ 
back. 

5  CONCLUSION 

The  internet  presents  an  inviting  opportunity  to  con¬ 
duct  “distributed”  experiments  that  may  address  some 
of  the  most  common  problems  of  empirical  software  en¬ 
gineering:  sample  size  and  representativeness.  However, 
research  is  required  to  demonstrate  feasibility  and  valid¬ 
ity  of  such  experiments  by  studying,  both  quantitatively 
and  qualitatively,  the  factors  that  affect  their  results. 
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Abstract.  We  provide  a  framework  for  the  specification  and  verifica¬ 
tion  of  high-performance  processors.  As  an  example,  we  give  a  high-level 
specification  and  correctness  proof  for  a  processor  that  uses  speculation, 
register  renaming,  superscalar  out-of-order  execution,  and  resolution  of 
memory  dependencies.  The  specifications  of  its  three  concurrently  oper¬ 
ating  units  are  very  general  and  can  be  refined  independently,  so  that 
our  proof  covers  a  whole  family  of  microarchitectures.  Abstract  treat¬ 
ment  of  data,  representation  of  on-the-fly  instructions  as  transactions, 
and  a  history  table  containing  the  full  information  of  a  processor’s  run 
are  the  main  features  of  the  proof. 


1  Introduction 

A  variety  of  formal  verification  tools  are  now  in  use  in  various  phases  of  hardware 
design;  [2,  8,  17]  are  but  a  few  notable  examples.  At  the  microarchitectural  level, 
however,  the  real  use  of  verification  is  limited,  mostly  due  to  the  immaturity 
of  the  available  techniques.  Indeed,  proving  the  correctness  of  a  combination 
of  aggressive  strategies  to  resolve  inter-instruction  dependencies  is  extremely 
difficult.  Still,  it  is  an  important  verification  aspect  because  microarchitectural 
defects  can  impact  a  large  fraction  of  the  design  and  so  are  hard  to  fix.  Engineers 
close  to  current  processor  design  teams  inform  us  that  designers  purposefully 
forgo  promising  optimizations  because  they  cannot  guarantee  the  optimizations 
preserve  correctness. 

Following  the  top-down  approach,  we  address  the  question  of  specifying  and 
verifying  processors  at  a  high  level.  On  a  worked  out  example,  we  show  how  to 
abstract  the  specification  as  much  as  possible  in  order  to  clearly  and  concisely 
specify  a  complex  microarchitecture  with  the  following  package  of  features:  spec¬ 
ulation,  register  renaming,  superscalar  out-of-order  execution  with  in-order  re¬ 
tirement,  and  resolution  of  memory  dependencies.  We  present  only  the  essentials 
of  the  microarchitecture,  just  enough  to  make  the  correctness  proof  possible.  The 
lower-level  details  are  left  to  further  refinement. 

Our  example  is  based  on  an  executable  processor  model  expressed  using 
Hawky  a  specification  language  with  stream  transformer  semantics  [7,  15].  This 
example  microarchitecture  is  close  to  Intel’s  PentiumPro  [10]  and  AMD’s  K6  [20]. 
It  is  partitioned  into  three  major  units  for  which  we  provide  independent  ax¬ 
iomatic  specifications.  We  show  that  the  visible  output  computed  by  this  mi¬ 
croarchitecture  is  equivalent  to  that  of  a  simple  reference  machine  implementing 


the  instruction  set  architecture.  This  approach  exhibits  a  very  desirable  form 
of  modularity  where  the  three  units  can  be  independently  refined  further  with¬ 
out  affecting  global  correctness.  Moreover,  since  the  units  are  to  a  large  extent 
underspecified,  our  proof  covers  a  whole  family  of  microarchitectures  that  can 
significantly  vary  in  implementation  details. 

To  write  the  specifications  and  organize  the  proof,  we  use  a  small  number 
of  concepts  and  structures  of  a  general  nature.  For  example,  our  correctness 
criterion  can  be  used  for  any  model  with  in-order  retirement.  Next,  transactions 
(a  formalized  notion  of  partially  computed  instructions)  seem  to  be  just  the 
right  microarchitectural  abstraction  that  provides  uniformity  in  the  description 
of  the  data  path.  Transactions  come  with  a  natural  partial  order  (progress  in 
computation  of  an  instruction)  that  enhances  their  expressiveness  and  can  be 
effectively  used  in  reasoning.  The  proof  itself  revolves  around  a  history  table 
which  contains  all  crucial  information  about  a  single  run  of  a  processor. 

After  a  brief  discussion  of  related  work,  the  rest  of  the  paper  is  organized  by 
sections,  as  follows:  we  specify  a  reference  machine,  introduce  transactions  and 
(informally)  our  processor  model,  describe  the  correctness  criterion,  explain  the 
history  table  and  the  structure  of  the  proof,  and  give  formal  specifications  of  the 
three  processor  components.  The  full  definition  of  the  history  table  and  a  proof 
of  the  correctness  theorem  are  relegated  to  the  Appendix. 

2  Related  Work 

The  complexity  of  verified  processor  models  described  in  the  literature  varies, 
largely  in  connection  with  the  level  of  proof  automation.  Highly  automated  meth¬ 
ods  show  a  promising  trend  of  consistent  increase  of  applicability,  including  im¬ 
pressive  recent  proofs  of  out-of-order  execution  [5,  16].  Still,  the  models  verified 
by  these  methods  are  rather  limited.  This  paper  belongs  to  the  other  end  of  the 
spectrum:  our  processor  model  is  one  of  the  most  complex,  but  at  the  price  of 
having  been  specified  in  a  rather  unconstraned  mathematical  style,  and  verified 
by  a  pencil-and-paper  proof.  The  same  can  be  said  of  the  work  of  Arvind  and 
Shen  [4],  whose  appealing  processor  model  is  defined  as  a  term-rewriting  sys¬ 
tem.  While  our  specifications  allow  refinement  in  the  most  obvious  sense,  it  is 
not  clear  how  the  correctness  result  of  [4]  that  relies  on  being  able  to  apply  the 
rewrite  rules  in  any  order  would  translate  to  a  lower-level  implementation  that 
lacks  that  property. 

With  Pnueli  and  Arons  [18]  we  share  the  insistence  on  maximal  abstraction 
and  modularity  stemming  from  specifying  the  processor  as  a  simple  composition 
of  concurrent  subsystems.  There  is  also  some  similarity  in  the  correctness  crite¬ 
rion,  based  on  the  idea  of  refinement.  Their  model,  however,  assumes  a  restricted 
instruction  set,  without  branches  and  memory  instructions. 

The  correctness  criterion  adopted  in  most  processor  verification  papers  is  the 
“commutative  diagram”  condition  of  Burch  and  Dill  [6],  or  some  version  thereof 
( cf .  [4,  12,  14,  19]).  Along  with  [18],  we  avoid  dealing  with  explicit  synchroniza¬ 
tion  and  abstraction  functions  that  match  the  states  of  the  verified  processor 
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with  the  states  of  the  reference  machine.  Instead,  our  criterion  requires  that  the 
two  sequences  of  retired  instructions  arising  from  running  the  same  program  on 
the  two  machines  are  equivalent. 

Dealing  with  memory  instructions  combined  with  out-of-order  execution  has 
only  recently  come  into  the  scope  of  processor  verification  efforts;  cf.  [4,  12,  19]. 
Our  execution  unit  allows  multiple  refinements  with  arbitrarily  sophisticated 
treatment  of  memory  operations  (load  bypassing,  for  example). 

A  remarkably  detailed  model,  including  a  treatment  of  exceptions,  is  verified 
by  Sawada  and  Hunt  [19]  using  a  methodology  which  has  many  similarities  to  our 
work.  The  key  structure  they  use,  the  Microarchitectural  Execution  Trace  Table , 
contains  entries  that  are  much  like  our  transactions.  This  table  represents  the 
current  computational  state  of  the  processor  like  a  row  of  our  history  table  does. 
A  global  invariant  relates  the  table  with  the  corresponding  microarchitectural 
state.  Since  it  references  most  of  the  state  elements,  this  invariant  presents  a 
difficult  proof  obligation,  which  unfortunately  is  only  briefly  discussed  in  [19]. 

Our  paper  promotes  hierarchical  verification  by  providing  a  very  general 
and  non-deterministic  model  and  a  straightforward  reduction  to  verification  of 
components.  At  this  level,  the  assume-guarantee  style  takes  a  simple  form:  all 
that  the  components  assume  of  the  environment  are  type-correct  values  on  their 
input  wires;  cf.  [11]. 


3  Standard  machine  (ISA) 


Our  reference  model  is  an  abstract  standard  machine ,  defined  as  a  state  machine 
whose  states  consist  of  values  for  the  program  counter,  register  file  and  memory. 
Most  of  the  common  instruction  set  architectures  are  instances  of  it  when  we 
ignore  the  treatment  of  external  exceptions. 

Definition  1.  Given  a  state  (pc,rf,mem),  the  standard  machine  (executing  a 
fixed  program  pgm^  makes  a  transition  to  the  state  (pc,,rf/,mem/)  defined  by  the 
following  set  of  equalities. 


I 

(opcode, rSources,  rDest) 
rOps 

(mSource,mDest) 

mOp 

(pc',  rRes,mRes) 
rf' 

mein 


pgm(pc) 

decode(J) 

rf  (rSources) 

get  Addr(opcode,  rOps) 

mem(mSource) 

compute(pc,  opcode,  r0ps,m0p) 

f  rf  [rDest  i-4  rRes]  if  rDest  £  Reg 
\  rf  z/rDest  =  () 

J  mem[mDest  »->  mRes]  z/mDest  £  Addr 
1  mem  z/mDest  =  () 


The  function  decode  extracts  the  opcode,  source  registers  and  the  destina¬ 
tion  register  from  an  instruction.  The  function  get  Addr  computes  the  addresses 
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mSource  for  loads  and  mDest  for  stores.  Finally,  the  results  of  compute  are  the 
new  value  for  the  program  counter  and  the  values  to  be  written  back  to  the 
register  file  or  memory. 

The  standard  machine  is  totally  data-insensitive.  It  uses  abstract  basic  types 
IAddr,  Instr,  Opcode,  Value,  Reg  and  Addr,  and  the  rest  is  typed  as  follows: 

pc:  IAddr,  pgm:  IAddr  — » Instr,  rf:  Reg  — ►  Value,  mem:  Addr Value 
decode  :  Instr  ->  Opcode,  RegSeq,  Reg11 
get  Addr  :  Opcode,  ValueSeq  — >>  Addr*1 ,  Addr** 

compute  :  IAddr,  Opcode,  ValueSeq,  Value11  -*  IAddr,  Value11,  Value11 

where  we  follow  the  convention  to  write  product  types  using  commas  and  func¬ 
tion  types  using  arrows.  The  notation  Type11  is  a  shorthand  for  the  sum  type 
Type  +  {()},  where  the  element  ()  indicates  a  value  that  does  not  need  compu¬ 
tation.  For  example,  the  first  component  of  the  result  of  getAddr  is  ()  unless  the 
first  argument  is  the  opcode  of  a  load  instruction.  Note  that  our  definition  allows 
a  single  instruction  to  have  the  combined  behavior  of  a  branch,  alu-instruction, 
load  and  store,  if  desired.  Particular  instructions  may  of  course  choose  to  only 
implement  a  subset  of  this  functionality. 


4  An  example  processor 


When  reasoning  about  the  execution  process  of  complex  processors  one  nor¬ 
mally  thinks  of  instructions  as  entities  that  come  into  being  at  a  certain  cycle 
and  evolve  thereafter.  Transactions  formalize  this  notion  of  partially  computed 
instructions.  Informally,  a  transaction  is  a  package  of  information  which  (di¬ 
rectly  or  indirectly)  contains  the  identity  of  the  unique  (static)  instruction  it  is 
associated  with  plus  various  data  extracted  from  the  processor’s  state  that  are 
relevant  for  the  execution  of  that  instruction. 

Guided  by  the  standard  machine  specification,  we  define  a  standard  transac¬ 
tion  as  a  record  with  the  following  eleven  fields: 


instr  :  Instr 
opcode  :  Opcode 
rSources  :  RegSeq 
rOps  :  ValueSeq 


rDest  :  Reg11 

mSource,  mDest  :  Addr11 
npc  :  IAddr 

mOp,  rRes,  mRes  :  Value11 


We  assume  that  all  our  basic  types  contain  a  value  _L,  indicating  an  uncom¬ 
puted  value.  We  will  also  use  the  notation  rOp^T)  for  the  zth  member  of  the 
sequence  rOps(T).  The  functions  decode,  getAddr  and  compute  treat  _L  as  an 
argument  in  a  lazy  fashion:  a  component  of  their  result  is  _L  only  if  some  crucial 
arguments  needed  for  computation  of  that  result  are  _L. 

A  natural  idea,  introduced  in  [3]  and  paradigmatic  for  the  Hawk  specification 
language  [15],  is  to  use  transactions  as  a  unifying  concept  in  microarchitectural 
specifications.  Transactions  are  passed  along  wires  and  manipulated  by  processor 
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components.  In  addition  to  the  above  standard  fields,  any  specific  microarchitec¬ 
ture  adds  fields  appropriate  for  the  description  of  its  execution  algorithm.  Our 
example  processor  adds  five  new  fields:  the  instruction  address  addr,  the  spec¬ 
ulative  next  program  counter  spc,  the  name  (alias)  name,  the  register  providers 
rProvs  and  the  most  recent  store  mrSt: 

addr,  spc  :  IAddr  rProvs  :  NameOptSeq 

name  :  Name  mrSt  :  NameOpt 

The  fields  rProvs  and  mrSt  will  record  dependencies  among  instructions.  Here 
NameOpt  =  Nam^  +  {NONE}  is  the  type  of  an  optional  name  field,  where 
NONE  serves  to  indicate  the  lack  of  dependency. 


n  dequeued 


FETCH 

UNIT 

rpc 

ORDERING 

UNIT 

computed 

EXECUTION 

UNIT 

rf 

xpc 

mem 

mature 

executing 

pc 

young 

lingering 

pc,xpc  :  IAddr 

rf  :  Reg  — *  Value 

mature, young  :  TransSeq 

mem  :  Addr  Value 

executing,  lingering  :  TransSet 


rpc  :  IAddr*1 

fetched,  dequeued,  prepared  :  TransSeq 

computed  :  TransSet 

flush,  writemem  :  Bool 


Fig.  1.  Top-level  specification  with  the  types  of  wires  (right)  and  state  components 
(left).  Thick  wires  represent  transaction  sets  or  sequences.  At  each  cycle,  units  update 
their  state  and  output  wires  depending  on  the  values  on  their  input  wires  and  state 
elements  at  the  previous  cycle. 


The  processor  consists  of  three  major  units  and  seven  wires  as  depicted  in 
Fig.  1.  The  fetch  unit  provides  multiple  instructions  at  each  cycle.  This  unit 
outputs  along  the  fetched  wire  transactions  with  filled  in  fields  instr,  addr  and 
spc.  The  fetching  of  instructions  begins  at  the  address  pc  if  the  current  value 
of  rpc  (requested  program  counter)  is  ();  otherwise  rpc  is  used.  The  fetching 
proceeds  by  unconstrained  speculation. 

The  ordering  unit  maintains  the  sequential  programming  model  of  the  ISA  by 
using  a  queue  made  by  concatenating  the  sequences  mature  and  young  (Fig.  2).  It 
takes  a  prefix  of  the  sequence  fetched  to  form  a  transaction  sequence  enqueued 
to  be  added  to  the  back  of  the  queue.  The  transactions  of  fetched  that  do  not 
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belong  to  the  chosen  prefix  are  discarded.  Each  transaction  added  to  the  queue 
gets  its  name  field  filled  in,  unique  in  the  queue.  The  mature  part  of  the  queue 
corresponds  to  transactions  already  sent  to  the  execution  unit.  Transactions 
in  prepared  are  taken  from  the  beginning  of  the  young  part  of  the  queue  and 
possibly  also  from  enqueued;  they  all  have  their  rOps,  rProvs  and  mrSt  fields  filled 
in.  The  elements  of  rOps  obtain  values  from  rf  when  there  is  no  dependency 
on  previous  transactions;  if  there  are  dependencies,  they  are  recorded  in  the 
elements  of  rProvs,  which  contain  the  names  of  the  transactions  that  will  provide 
the  appropriate  values  when  computed.  The  field  mrSt  contains  the  name  of 
the  last  preceding  store  in  the  queue;  it  is  used  only  by  loads  and  stores  for 
future  resolution  of  dependencies  among  them.  The  mature  part  of  the  queue  is 
updated  by  transactions  arriving  along  the  computed  wire,  then  a  prefix  of  the 
resulting  sequence  consisting  entirely  of  complete  transactions  is  retired,  that  is, 
sent  along  the  dequeued  wire  while  updating  rf .  When  a  retired  transaction  is 
a  mispredicting  branch,  then  the  queue  is  emptied,  the  Boolean  wire  flush  is 
asserted  and  rpc  set  equal  to  the  address  of  the  last  retired  transaction.  The 
wire  rpc  is  also  given  a  non-trivial  value  when  not  all  fetched  transactions  are 
enqueued.  In  this  case  the  rpc  is  set  to  the  spc  of  the  last  enqueued  transaction. 


1  fetched  | 

mature 

J  young 

If  enqueued  1 

dequeued  If 

mature’ 

unm 

|  prepared 

] 

I  fetched  | 


I  mature  ||  young 

II  enqueued  I 

■IIBFIl 

|  prepared 


Fig.  2.  Two  possible  scenarios  for  the  relationship  between  transaction  sequences  in¬ 
volved  in  a  transition  of  the  ordering  unit.  The  inputs  are  fetched,  mature  and  young, 
and  the  outputs  are  dequeued,  prepared,  mature7  and  young7.  The  sequences  are 
aligned  so  that  if  two  transactions  are  on  the  same  vertical  line,  then  the  higher  one  is 
less  than  or  equal  to  the  lower  (in  the  progress  ordering  defined  below). 


The  execution  unit  is  an  out-of-order  component  that  computes  the  results 
rRes  and  mRes  of  transactions  contained  within  it  and  determines  which  of  these 
transactions  are  mispredicting  (by  computing  npc  for  each  and  comparing  it  with 
spc).  It  may  also  execute  a  memory  store  if  the  value  on  the  wire  writemem  indi¬ 
cates  that  it  is  right  time  to  do  so.  A  number  of  completed  transactions  are  sent 
out  along  the  computed  wire,  while  placing  them  in  the  set  lingering,  where 
each  of  them  will  remain  intact  until  the  moment  when  an  equally  named  trans¬ 
action  comes  along  the  prepared  wire  and  takes  its  place.  When  a  transaction 
is  sent  to  computed  (or  sooner),  the  values  in  its  result  fields  are  forwarded  to 
all  other  transactions  in  executing.  There  are  no  requirements  on  the  number 
of  transactions  executed  at  each  cycle  and  the  only  requirement  on  the  order  of 
their  execution  is  that  the  data-flow  order  is  respected. 


6 


5  Correctness  criterion 


One  can  slightly  extend  the  definition  of  the  standard  machine  so  that  at  each 
cycle  it  outputs  a  complete  transaction  (corresponding  to  the  instruction  com¬ 
pleted  at  that  cycle).  A  run  of  the  standard  machine  then  defines  a  sequence  of 
“retired”  transactions  from  which  the  corresponding  sequence  of  states  of  the 
standard  machine  can  easily  be  reconstructed. 

A  transition  of  a  complex  processor  cannot,  in  general,  be  associated  with  a 
unique  transaction,  but  with  a  sequence,  possibly  empty,  of  transactions  retired 
on  that  transition.  So,  suppose  P  is  a  processor  and  denote  by  pn  the  sequence 
of  transactions  retired  by  P  on  its  nth  cycle.  Concatenating  these  sequences  we 
obtain  p^  =  pip2  •  *  * .  Replacing  every  transaction  in  />oo  with  the  corresponding 
standard  transaction  (which  amounts  to  ignoring  its  “non-standard”  fields),  we 
obtain  a  sequence  of  standard  transactions  p^d,  which,  if  P  does  implement  the 
standard  machine,  should  be  identical  to  the  appropriate  execution  sequence  of 
the  standard  machine.  This  gives  us  the  following  correctness  criterion. 

Definition  2.  A  processor  P  is  correct  with  respect  to  the  standard  machine  if 
for  any  given  program  pgm  and  a  state  ao  of  the  standard  machine,  there  exists 
an  initial  state  of  P  such  that  the  execution  of  pgm  on  P  produces  a  sequence  of 
retired  transactions  poo  with  the  associated  sequence  psf^  equal  to  the  execution 
sequence  defined  by  the  program  pgm  and  the  initial  state  ao . 

The  notion  of  the  execution  sequence  is  made  precise  below,  after  a  brief 
elaboration  of  the  type  of  transactions. 

5.1  The  progress  ordering  of  transactions 

We  define  the  progress  ordering  -<  on  the  set  of  transactions  so  that  T\  <  T2 
will  mean  that  X2  is  a  computationally  more  advanced  (“closer  to  retirement”) 
version  of  The  relation  <  is  the  product  of  16  partial  orders  (all  denoted 
<) — one  for  each  record  component.  These  component  orders  are  defined  as 
follows.  For  each  basic  type  (including  Name),  we  make  _L  the  smallest  element 
and  all  other  elements,  including  (),  incomparable.  In  NameOpt,  NONE  is  the 
largest  element.  Finally,  two  sequences  are  comparable  if  and  only  if  they  have 
the  same  length  and  the  elements  of  one  of  them  are  all  less  than  or  equal  to  the 
corresponding  elements  of  the  other. 

The  partial  order  just  introduced  allows  us  to  define  the  notion  of  intrinsic 
consistency  of  transactions.  Intuitively,  a  transaction  is  consistent  if  the  contents 
of  its  fields  do  not  contradict  any  of  the  equations  occurring  in  the  definition 
of  the  standard  machine.  Of  these  equations,  the  ones  that  do  not  involve  the 
components  of  the  machine  state  (program  counter,  register  file  and  memory) 
give  rise  to  consistency  criteria: 

(opcode(T),  rSources(T),  rDest(T))  ■<  decode(instr(T)) 
(mSource(T),mDest(T))  <  getAddr(opcode(T),  rOps(T)) 
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(npc(T),  rRes(T),  mRes(T))  <  compute(addr(T),  opcode(T),  instr(T),  rOps(jH),  mOp(T)) 

By  definition,  a  transaction  is  consistent  if  its  fields  satisfy  these  inequalities.  We 
define  Trans  to  be  the  set  of  all  consistent  transactions.  Note  that  consistency 
of  a  transaction  depends  entirely  on  the  contents  of  its  “standard”  fields  and 
that  all  strictly  increasing  chains  in  the  poset  (Trans,  ■<)  are  of  finite  length. 

Maximal  transactions  with  respect  to  the  ordering  ■<  will  be  called  complete ; 
a  transaction  is  complete  if  none  of  its  fields  is  ±,  and  mrSt  and  all  component 
fields  of  rProvs  are  NONE. 


5.2  Execution  sequences 

For  every  transition  of  the  standard  machine  there  is  an  associated  complete 
standard  transaction.  To  define  it,  just  use  the  left-hand  sides  of  the  equations 
in  Definition  1.  Thus,  together  with  every  run  of  the  standard  machine,  one 
can  consider  the  corresponding  transaction  sequence  {7\,T2,  . . .),  where  T{  cor¬ 
responds  to  the  ith  transition.  Characterizing  properties  of  such  sequences  are 
collected  in  Definition  3  below. 

If  r  is  a  (finite  or  infinite)  sequence  of  transactions  or  standard  transactions 
and  T  a  transaction  in  r,  we  define  the  ith  register  provider  of  T  to  be  the 
transaction  U  of  r  which  precedes  T  and  has  the  property  that  rSource;(T)  = 
rDest (U),  while  rSourcei(T)  ^  rDest(y)  for  all  transactions  V  between  U  and  T. 
Similarly,  we  define  U  to  be  the  store  provider  of  T  if  T  is  a  load  and  U  is  the 
last  store  among  the  transactions  that  precede  T  in  r  and  satisfy  mSource(T)  = 
mDest  (U). 


Definition  3.  An  infinite  sequence  r  =  (Ti,T2,...)  is  an  execution  sequence 
corresponding  to  the  program  pgm  and  the  initial  state  {pcinit,rfinitlmeminit)  if 
every  Tm  is  a  complete  transaction  and 


instr(Tm)  =  j 
rOPi(Tm)  =  | 
mOp(Tm)  =  | 


Pgm(pCimt)  ifm  =  0 

pgm(npc(Tm_i))  ifm  >  0 

rRes^)  ifTk  is  the  ith  register  provider  for  Tmin  r 

rf (rSourcei(Tm))  i}Tm  does  not  have  an  ith  provider  in  r 

mRes(Tfc)  ifTk  is  the  store  provider  for  Tm  in  r 

memini*(mSource(Tm))  ifTm  does  not  have  a  store  provider  in  r 


6  History  Table  (Structuring  the  proof) 

Reasoning  about  the  execution  of  processors  can  be  conveniently  organized 
around  a  history  table.  Two  simple  observations  are  behind  its  definition.  First, 
if  Ii ,  I2 , . . .  is  the  sequence  of  instructions  considered  by  the  processor  during 
a  run,  then  each  transaction  T  found  anywhere  in  the  processor  at  any  time  is 
associated  with  a  unique  fetched  instruction  Ij ;  we  say  that  j  is  the  ordinal  of  T. 
The  second  observation  is  that  there  are  only  finitely  many  essentially  different 
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execution  patterns  for  an  instruction  and  that  one  can  define  a  finite  transition 
diagram  describing  those  patterns.  Each  node  of  this  transaction  flow  diagram 
r  corresponds  to  a  distinguished  “pipeline  stage”  and  will  be  called  a  status. 

A  history  table  is  defined  for  every  run  of  the  processor.  At  the  nth  row 
and  the  ith  column  of  the  table  one  finds  a  pair  if*  =  (T,X),  where  T  is  the 
transaction  that  represents  the  state  of  computation  of  U  at  the  nth  cycle  and 
X  is  the  status  of  that  computation.  Formally,  is  defined  in  terms  of  the  set 
of  transactions  with  ordinal  i  which  are  present  in  the  processor  at  the  nth  cycle, 
and  the  values  of  “control”  variables  at  that  cycle;  normally,  T  is  the  maximal  of 
those  transactions  and  the  status  X  corresponds  to  the  set  of  locations  in  which 
they  are  found. 


Fig.  3.  Transaction  flow  diagram  F.  The  transitions  to  squashed  occur  only  when 
flush  =  TRUE. 


For  our  example  processor,  T  is  given  in  Fig.  3.  The  top  row  represents 
the  execution  patterns  of  successfully  completed  instructions.  Looping  at  young 
means  waiting  to  be  sent  to  the  execution  unit;  the  loops  at  executing  and 
ripe  have  similar  meaning.  The  status  ripe  corresponds  to  the  set  of  complete 
transactions  contained  in  mature.  The  final  statuses  ignored  and  squashed  are 
for  transactions  aborted  because  of  the  overflow  in  the  ordering  unit  (inability 
to  enqueue  all  fetched  transactions)  and  misprediction,  respectively. 

The  rows  of  the  history  table  are  finite;  the  length  of  the  nth  row  is  equal 
to  the  total  number  of  fetched  instructions  in  the  first  n  cycles.  All  columns 
stabilize:  for  each  i,  we  have  =  Hln  for  all  large  n.  This  follows  since  both 
Trans  and  F  are  posets  in  which  strictly  increasing  chains  are  finite.  We  define 
the  limit  row  Hoq  as  the  sequence  of  the  limit  values  of  columns:  H ^  =  limn  H * . 

For  any  n  <  oo,  denote  by  rn  the  sequence  of  transactions  occurring  in 
the  nth  row  Hn  of  H.  Let  also  denote  the  sequence  consisting  of  only  those 
transactions  occuring  in  Hn  whose  corresponding  status  component  is  dequeued. 
The  correctness  of  the  processor  can  then  be  restated  as  follows. 

Theorem,  is  an  execution  sequence. 

In  view  of  Definition  3,  this  presents  us  with  four  proof  obligations. 
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□ 

fetched 

active 

(££1 

dequeued 

FT] 

squashed 

■ 

ignored 

Fig.  4.  Seven  consecutive  rows  in  the  middle  of  a  history  table.  The  second  depicts  a 
cycle  when  only  part  of  the  fetched  transactions  is  enqueued.  The  first  misprediction  is 
seen  in  the  fourth  row;  transactions  fetched  at  this  cycle  are  ignored  at  the  next,  when 
also,  due  to  the  misprediction,  the  fetching  unit  was  unable  to  output.  (“Active”  stands 
for  statuses  that  are  neither  initial  nor  final  and  reflects  the  queue  in  the  ordering  unit.) 


Proposition  1.  The  sequence  is  infinite. 

Proposition  2.  IfU  andT  are  two  consecutive  elements  ofr^,  then  np c(U)  = 
addr(T).  Also ,  the  value  or  the  addr  field  of  the  first  transaction  of  id  pcinit. 

Proposition  3.  Let  T  be  a  transaction  in  IfU  is  the  rth  register  provider 
of  T  in  r^,  then  rOpr(T)  =  rRes (U),  and  ifT  does  not  have  an  rth  provider  in 
then  rSources r(T)  =  rfin^(rSourcer(T)). 

Proposition  4.  Let  T  be  a  transaction  in  r^.  IfU  is  the  store  provider  of  T 
in  then  mOp(T)  =  mR  es(U),  and  ifT  does  not  have  a  store  provider  in  r^, 
then  mOp (T)  =  memin^(mSource(T)). 

The  proof  of  Proposition  1  uses  the  liveness  conditions  of  components.  The 
major  results  one  needs  to  establish  are  the  infinity  of  the  sequences  of  fetched 
and  enqueued  transactions,  and  the  absence  of  livelock,  expressed  as  the  state¬ 
ment  that  all  locations  in  H ^  are  final.  Proving  the  remaining  three  propositions 
involves  a  rather  straightforward  but  tedious  chasing  around  the  history  table. 

7  Formal  specification 

Staying  close  to  the  Hawk  specification  style,  we  model  processors  and  their 
components  as  state  machines,  which  use  sets  of  input  wires,  output  wires,  and 
states,  each  wire  and  each  piece  of  state  having  a  prescribed  type.  The  machine 
is  then  defined  by  a  function  whose  arguments  are  the  values  for  input  wires 
and  states,  and  whose  results  are  values  for  the  output  wires  and  states  in  the 
next  clock  cycle.  Consequently,  the  machine  acts  as  a  signal  transformer:  for  any 
given  signals  (infinite  sequences)  of  inputs  and  initial  values  of  states,  it  produces 
uniquely  determined  signals  of  outputs. 

An  axiomatic  specification  of  a  state  machine  could  consist  of  a  list  of  its 
input,  output  and  state  variables,  an  initial  condition,  an  invariance  condition, 
and  a  liveness  condition.  Without  making  these  notions  precise,  we  note  that  an 
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invariant  is  a  propositional  formula  written  in  terms  of  input  variables,  output 
variables,  state  variables  and  primed  state  variables,  and  a  liveness  condition  is 
a  property  of  signals  expressible  by  a  suitable  formula  in  temporal  logic. 

Again  without  going  into  technicalities,  state  machines  can  be  composed  by 
identifying  each  output  wire  of  the  constituent  machines  with  some  (zero  or 
more)  input  wires.  At  the  level  of  signals,  which  is  how  it  is  done  in  Hawk, 
composition  amounts  to  writing  a  system  of  equations,  each  corresponding  to  a 
component  machine. 

The  input,  output  and  state  variables  of  the  three  components  of  our  proces¬ 
sor  can  be  read  off  from  Fig.  1,  which  also  tells  how  the  wires  are  joined  to  give  a 
specification  of  the  processor  as  a  composition  of  its  components.  The  formulas 
for  specifications  of  components  are  given  below,  after  introducing  notational 
conventions. 

The  values  pgm,  pcinii,  rf  inu  and  mem ina  are  constants. 

We  restrict  the  type  TransSeq  to  “uniquely  named”  sequences:  if  two  trans¬ 
actions  in  a  sequence  have  names  x  and  y,  none  of  which  is  ±,  then  x  ^  y. 
The  concatenation  of  sequences  a  and  /3  is  denoted  a  A  partial  order  on 
the  set  of  transaction  sequences  is  defined  by  a  <  if  and  only  if  \a\  =  \0\  and 
a[i]  X  (3[i\  for  every  i.  A  transaction  is  mispredicting  if  its  spc  and  npc  fields  are 
not  equal,  and  none  is  equal  to  _L.  A  transaction  is  decoded  if  none  of  its  fields 
opcode,  rSources,  rDest  contains  ±.  A  transaction  is  independent  if  its  mrSt  and 
and  rOps  fields  are  maximal  (the  first  is  NONE  and  the  second  does  not  contain 
±).  A  transaction  T  depends  on  another  transaction  U  if  rProv^T)  =  name(J7) 
or  mrSt(T)  =  nam e(U),  If  T  is  a  transaction  in  a  transaction  sequence  a,  then 
the  most  recent  store  of  T  in  a  is  the  last  store  in  a  that  precedes  T.  Finally, 
if  A  is  a  transaction  set  and  T  is  a  transaction,  then  the  store  chain  of  T  in  A 
is  the  maximal  sequence  (5fe,...,Si)  with  the  properties  mrSt(T)  =  name(Si) 
and  mrSt(Si)  =  name(Si+i)  for  1  <  i  <  k. 

Transaction  sets  have  the  property  that  different  elements  of  a  set  have  dis¬ 
tinct  names;  we  use  the  type  TransSet  =  (Name-  {-L})  -4  Trans8  to  represent 
such  sets.  For  A  and  B  in  TransSet,  we  denote  by  A  U  B  the  union  of  A  and 
B  with  A  having  the  higher  priority;  that  is,  if  A  and  B  both  have  a  transac¬ 
tion  named  x1  then  the  transaction  named  x  of  A  U  B  is  that  of  At  (This  union 
operation  is  associative,  but  not  commutative.)  The  notation  A  <  B  means  by 
definition  that  A(x)  <  B(x)  for  every  x  E  Name.  Note  that  there  is  a  canonical 
map  TransSeq  -4  TransSet,  so  every  transaction  sequence  can  be  regarded  as 
a  transaction  set. 

For  rf  €  Reg  -4  Value,  mem  E  Addr  -4  Value,  v  £  Value,  r  E  Reg  and 
a  E  Addr,  the  values  of  the  updated  register  files  and  memories  are  denoted 
by  rf  [r  h4  v]  and  mem[a  i-4  v].  Note  the  role  of  J_  in  updating  functions:  if 
rf'  =  rf[l  4  v],  then  rf '(r)  =  _L  for  every  r,  but  if  rf7  =  rf[r  4  JL]  then 

rf'(s)  =  rf(s)  for  every  s  ^  r.  Updating  of  a  register  file  and  memory  by  a 

transaction  is  defined  by 

rp  _  i  rf  [rDest  (T)  rRes(T)]  if  rDest  (T)  E  Reg 

rf  * 1  ~  \  rf  if  rDest(T)  =  () 
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__  J  mem[mDest(T)  h->  mRes(T)]  if  mDest(T)  £  Addr 
mem  |  mem  if  mDest(T)  =  () 

The  results  rf  •  r  and  mem  •  r  of  updating  rf  and  mem  by  a  finite  transaction 
sequence  r  are  then  defined  in  a  straightforward  manner. 

- Fetching  Unit - 

Let  pc-rpc  =  pc  if  rpc  =  ();  otherwise  pc-rpc  =  rpc. 

Fetch-Init.  The  initial  values  of  pc  and  fetched  are  pcinit  and  ()  respectively. 

Fetch-Inv  1.  instr(T)  =  pgm(addr(T)),  for  every  transaction  T  occurring  in 
fetched. 

Fetch-Inv  2  (Speculation),  //fetched  =  (Ti, . . .  ,2*),  then  addr(Ti)  =  pc-rpc, 
and  addr(T^+i)  =  spc(T*)  for  every  i  £  {1, . . . ,  k  —  1}. 

Fetch-Inv  3  (Next  PC),  pc'  =  spc(T)  ifT  is  the  last  transaction  of  fetched, 
and  pc'  =  pc-rpc  if  fetched  =  (). 

Fetch-Inv  4  (Empty  fields).  A  field  of  a  transaction  in  fetched  has  a  value 
different  from  ±  if  and  only  if  that  field  is  instr,  spc  or  addr. 

Fetch-Liv.  The  formula  rpc  ^  ()  V  fetched  ^  ()  is  true  infinitely  often. 

- Ordering  Unit - 

Denote  queue  =  mature  ^  young. 

Ord-Init.  The  initial  values  of  xpc,  rf,  queue,  flush,  prepared  and  rpc  are 
pciniv  rf  init,  (),  FALSE,  ()  and  ()  respectively . 

Ord-Inv  1  (Naming).  All  transactions  in  queue  have  distinct  names. 

Ord-Inv  2  (Queue).  Let  mature*  be  the  sequence  obtained  from  mature  by 
replacing  every  transaction  in  it  with  an  equally  named  transaction  of  computed, 
if  it  exists.  If  flush  =  TRUE  then  queue'  =  prepared  =  ()  and  dequeued  is  a 
prefix  of  mature*.  If  flush  =  FALSE,  then  there  exists  a  prefix  enqueued  of 
fetched  such  that 


young  ^enqueued  X  prepared  ^young', 
mature*  ^prepared  =  dequeued  ^mature'. 

Ord-Inv  3  (Enqueueing).  IfT  is  the  first  trans  a  ction  of  enqueued,  tfienaddr(T) 
xpc.  Finally,  if  queue  =  ()  and  xpc  =  addr(T),  where  T  is  the  first  transaction 
of  fetched,  then  enqueued  /  (). 

Ord-Inv  4  (Preparation).  Let  T  be  a  transaction  in  prepared.  Then 
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1.  T  is  decoded ,  rRes(T)  =  ±,  and  T  e  mature'. 

2.  (rOp^T),  rProVi(T))  =  (J_,  name(?7))  ifU  is  the  ith  register  provider  ofT  in 
queue',  and  (rOp^T),  rProVi(T))  =  (rf'(rSourcei(T)),  NONE)  if  T  does  not 
have  the  ith  register  provider  in  queue'. 

3.  mrSt(T)  =  name(S)  if  S  is  the  most  recent  store  for  T  in  queue',  and 
mrSt(T)  =  NONE  if  this  most  recent  store  does  not  exist.  The  value  of 
mOp(T)  is  ±  or  (),  depending  on  whether  T  is  a  load  or  not 

Ord-Inv  5  (Dequeueing).  All  transactions  of  dequeued  are  complete  and  none 
of  them ,  except  possibly  the  last  one ,  is  mispredicting. 

Ord-Inv  6  (Register  File),  rf'  =  rf  •  dequeued. 

Ord-Inv  7  (Flush),  flush  =  TRUE  if  and  only  if  the  last  transaction  in  dequeued 
is  mispredicting. 

Ord-Inv  8  (Enabling  a  memory  write),  writemem  =  TRUE  if  and  only  if 
the  first  transaction  of  queue'  is  an  incomplete  store. 

Ord-Inv  9  (Requested  PC). 

{npc(JD)  if  flush  =  TRUE 

addr(i£)  if  flush  =  FALSE  and  [enqueued]  <  |fetched|  , 

()  otherwise 

where  D  is  the  last  transaction  of  dequeued  and  E  =  fetched(| enqueued!  +  1)- 
Ord-Inv  10  (Expected  PC). 

{rpc  if  rpc^() 
spc(T)  if  rpc  =  ()  and  enqueued  ±  {)  , 
xpc  otherwise 

where  T  is  the  last  transaction  of  enqueued. 

Ord-Liv.  If  the  first  transaction  of  queue  is  complete,  then  eventually  dequeued  ^ 
().  If  mature  =  ()  and  young  ^  (),  then  eventually  prepared  ^  (). 


- Execution  Unit - 

Exec-Init.  The  initial  value  of  mem  is  mem ina,  and  0  is  the  initial  value  of 
executing,  lingering  and  computed. 

Exec-Inv  1  (Flushing).  If  flush  —  TRUE,  then  executing7  =  lingering7  — 
0  and  mem'  =  mem. 

Exec-Inv  2  (Contents).  The  sets  executing  and  lingering  are  disjoint.  If 
flush  =  FALSE  then 

executing  U  prepared U  lingering  <  executing'  U  lingering'.  (1) 
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If  T  is  an  element  of  the  left-hand  side  of  (1)  and  V  is  the  corresponding 
element  of  the  right-hand  side,  we  will  say  that  T7  is  the  descendant  of  T.  Note 
that  the  only  transactions  of  executing  U  lingering  without  a  descendant  are 
members  of  lingering  whose  name  occurs  in  a  transaction  of  prepared. 

Exec-Inv  3  (Lingering).  Assume  flush  =  FALSE.  Then  all  transactions  in 
lingering  are  complete  and  no  transaction  in  executing  depends  on  any  trans¬ 
actions  of  lingering.  Also ,  a  transaction  belongs  to  lingering7  if  and  only  if 
it  either  belongs  to  computed,  or  is  a  descendant  of  a  transaction  in  lingering. 

If  L  is  a  load  in  executing  U  prepared  and  0  is  the  store  chain  of  L  in  this 
set,  then 

mOp(L)  •<  (mem  -  0)(mSourc e(L))  (LC) 

is  a  condition  that  should  be  satisfied  by  the  execution  unit.  Note  that  the  value 
on  the  right-hand  side  is  _L  if  mDest(5)  =  J_  for  some  S  in  0.  If  mDest(S)  ^  _L  for 
all  S  in  0,  then  the  value  on  the  right-hand  side  is  either  (1)  mRes(S),  where  S  is 
the  last  transaction  in  0  with  mDest(S)  =  mSource(L),  or  (2)  mem(mSource(L)), 
if  no  such  5  exists. 

Exec-Inv  4  (Load  Correctness) .  If  V  is  the  descendant  of  a  load  L  which 
satisfies  the  condition  (LC),  then  V  satisfies  (LC)  too . 

Exec-Inv  5  (Forwarding).  IfTf  is  the  descendant  ofT,  then  (rProv^T7),  rOp^T7)) 
(rProvi(T),rOpi(T)),  or  (rProv;(T7),  rOp^T7))  =  (none,  rRes (17)),  where  U  € 
executing  U  lingering,  rProvj(T)  =  name({7),  and  rRes({7)  ^  _L. 

Exec-Inv  6  (Memory).  1.  If  mem'  ^  mem,  then  writemem  =  TRUE  and  mem7  = 
mem  •  S,  where  S  is  a  complete  store  in  executing. 

2.  If  computed  contains  a  store  5,  then  mem7  =  mem  •  S  and  writemem  =  TRUE. 

Exec-Inv  7  (Most  Recent  Store).  7/T7  and  Ul  are  descendants  ofT  andU , 
and  if  mrSt(T)  =  name(J7),  then  mrSt(T7)  =  name(£/7)  unless  U'  €  computed  or 
T  is  a  load  with  mOp(T)  ^  _L. 

Exec-Liv.  Let  T  be  an  independent  transaction  in  executing.  If  T  is  a  store , 
assume  also  that  writemem  =  TRUE.  Then  eventually  flush  =  TRUE  or  name(T) 
occurs  among  names  of  transactions  in  computed. 


8  Conclusions 

In  an  attempt  to  bring  the  power  of  verification  closer  to  the  complexity  of  com¬ 
mercial  processors,  we  have  specified  a  general  microarchitectural  design  and 
proved  its  correctness.  Our  axiomatization  can  be  satisfied  by  a  family  of  mi¬ 
croarchitectures;  therefore,  it  retains  a  good  deal  of  flexibility  as  the  structure 
of  the  individual  components  is  developed.  Since  each  component  is  specified 
independent  of  other  components,  the  implementation  and  proof  of  components 
can  be  carried  out  independently.  Furthermore,  our  specifications  and  proof  are 
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independent  of  many  considerations  that  affect  performance.  For  example,  we 
do  not  need  to  set  the  number  and  latencies  of  subunits  of  our  execution  units, 
the  width  of  instruction-carrying  wires,  the  accuracy  of  branch  prediction  etc. 
Therefore,  many  design  decisions  based  on  simulation  may  be  made  without  ad¬ 
versely  affecting  the  global  correctness  proof.  Note  also  that  the  wires  present  in 
our  top-level  specification  are  just  what  is  necessary  for  interunit  communica¬ 
tion.  The  units  are  free  to  communicate  through  extra  channels;  for  example,  an 
extra  wire  allows  implementation  of  a  branch  target  buffer  within  the  fetching 
unit. 

Most  of  the  advantages  of  our  approach  come  as  a  consequence  of  using  a 
severely  minimized  axiomatization.  This  approach  is  not  quite  common,  proba¬ 
bly  because  coming  up  with  a  reasonably  complete  set  of  invariants  for  an  algo¬ 
rithm  is  generally  difficult.  Considerable  skill  is  required  to  extract  the  axioms, 
but  in  a  limited  domain,  such  as  that  of  hardware  design,  it  could  be  feasible.  We 
plan  to  explore  the  axiomatics  for  hardware  components  and  develop  a  library 
of  specifications  and  typical  proofs. 

We  intend  to  construct  various  refinements  of  our  component  specifications 
and  thus  to  show  that  our  axiomatizations  can  be  related  to  specific  microarchi¬ 
tectures.  We  have  already  developed  executable  PentiumPro-like  specifications 
in  Hawk  using  the  same  structure  described  here  (see  [1]);  we  plan  to  prove 
the  correctness  of  these  executable  models  by  checking  their  three  units  satisfy 
our  axioms.  Transactions,  as  we  have  demonstrated,  are  a  useful  microarchitec- 
tural  abstraction,  but  they  also  come  with  a  substantial  overhead  that  should 
be  eliminated  in  lower-level  refinements.  We  plan  to  develop  a  methodology  for 
shrinking  the  interfaces  of  our  top-level  specifications. 

We  expect  that  further  research  will  confirm  that  reasoning  around  the  his¬ 
tory  table  is  a  promising  proof  technique,  applicable  to  pipeline  designs  in  gen¬ 
eral.  Also  left  to  further  research  is  rewriting  our  axiomatics  in  a  more  stringent 
specification  style,  and  mechanization  of  the  proofs. 


Acknowledgments.  For  their  contributions  to  this  research,  we  thank  Mark 
Aagaard,  Borislav  Agapiev,  Robert  Jones,  and  John  O’Leary  of  Intel  Strategic 
CAD  Labs;  Tito  Autrey,  Nancy  Day,  Dick  Kieburtz  and  Thomas  Nordin  of  OGI; 
and  Arvind  of  MIT. 

The  authors  are  supported  by  Intel  Strategic  CAD  Labs  and  Air  Force  Ma¬ 
terial  Command  (F19628-93-C-0069).  John  Matthews  receives  support  from  a 
graduate  research  fellowship  with  the  National  Science  Foundation. 


References 


[1]  Hawk  Web  page:  http://www.cse.ogi.edu/PacSoft/Hawk/. 

[2]  M.  Aagaard,  R.  Jones,  and  C.-J.  Seger.  Combining  theorem  proving  and  trajectory 
evaluation  in  an  industrial  environment.  In  35th  Design  Automation  Conference 
(DAC  '98),  pages  538-541.  Association  for  Computing  Machinery,  1998. 


15 


[3]  M.  Aagaard  and  M.  Leeser.  Reasoning  about  pipelines  with  structural  hazards.  In 
Second  International  Conference  on  Theorem  Provers  in  Circuit  Design ,  volume 
901  of  Lecture  Notes  in  Computer  Science.  Springer- Verlag,  1995. 

[4]  Arvind  and  X.  Shen.  Design  and  verification  of  processors  using  term  rewriting 
systems.  IEEE  Micro ,  1999.  to  appear. 

[5]  S.  Berezin,  A.  Biere,  E.  Clarke,  and  Y.  Zhu.  Combining  symbolic  model  checking 
with  uninterpreted  functions  for  out-of-order  processor  verification.  In  [9],  pages 
369-386. 

[6]  J.  Burch  and  D.  Dill.  Automatic  verification  of  pipelined  microprocessor  con¬ 
trol.  In  Computer  Aided  Verification ,  volume  818  of  Lecture  Notes  in  Computer 
Science ,  pages  68-70.  Springer- Verlag,  1994. 

[7]  B.  Cook,  J.  Launchbury,  and  J.  Matthews.  Specifying  superscalar  microprocessors 
with  Hawk.  In  Workshop  on  Formal  Techniques  for  Hardware  and  Hardware-like 
Systems ,  Marstrand,  Sweden,  June  1998. 

[8]  A.  P.  Eiriksson.  The  formal  design  of  lM-gate  ASICs.  In  [9],  pages  49-63. 

[9]  G.  Gopalakrishnan  and  P.  Windley,  editors.  Formal  Methods  in  Computer- 
Aided  Design  (FMCAD  ’98),  volume  1522  of  Lecture  Notes  in  Computer  Science. 
Springer- Verlag,  1998. 

[10]  L.  Gwennap.  Intel’s  P6  uses  decoupled  superscalar  design.  Microprocessor  Report , 
9(2):9-15,  1995. 

[11]  T.  A.  Henzinger,  S.  Qadeer,  and  S.  K.  Rajamani.  You  assume,  we  guarantee: 
Methodology  and  case  studies.  In  [13],  pages  440-451. 

[12]  R.  Hosabbettu,  M.  Srivas,  and  G.  Gopalakrihnan.  Decomposing  the  proof  of 
correctness  of  pipelined  microprocessors.  In  [13],  pages  122-134. 

[13]  A.  J.  Hu  and  M.  Y.  Vardi,  editors.  Computer  Aided  Verification  (CAV  ’98), 
volume  1427  of  Lecture  Notes  in  Computer  Science.  Springer- Verlag,  1998. 

[14]  R.  B.  Jones,  J.  U.  Skakkebaek,  and  D.  L.  Dill.  Reducing  manual  abstraction  in 
formal  verification  of  out-of-order  execution.  In  [9],  pages  2-17. 

[15]  J.  Matthews,  J.  Launchbury,  and  B.  Cook.  Specifying  microprocessors  in  Hawk. 
In  1998  International  Conference  on  Computer  Languages ,  pages  90-101.  IEEE 
Computer  Society,  1998. 

[16]  K.  McMillan.  Verification  of  an  implementation  of  Tomasulo’s  algorithm  by  com¬ 
positional  model  checking.  In  [13],  pages  110-121. 

[17]  J.  Moore,  T.  Lynch,  and  M.  Kaufmann.  A  mechanically  checked  proof  of  the 
correctness  of  the  kernel  of  the  AMD  K86.  IEEE  Transactions  on  Computers , 
47(9):913-926,  1998. 

[18]  A.  Pnueli  and  T.  Arons.  Verification  of  data-insensitive  circuits:  An  in-order- 
retirement  study.  In  [9],  pages  351-568. 

[19]  J.  Sawada  and  W.  Hunt.  Processor  verification  with  precise  exceptions  and  spec¬ 
ulative  execution.  In  [13],  pages  135-146. 

[20]  B.  Shiver  and  B.  Smith.  The  Anatomy  of  a  High-Performance  Microprocessor:  A 
Systems  Perspective.  IEEE  Computer  Society,  1998. 


16 


A  Appendix:  Correctness  Proof 


In  Sect.  6  we  gave  a  brief  and  incomplete  description  of  the  history  table  as¬ 
sociated  to  a  run  of  our  processor  model.  A  precise  definition  is  given  below 
in  Subsect.  A.3.  In  particular,  we  prove  that  the  columns  of  the  history  table 
stabilize  (Lemma  12),  so  that  the  sequence  r0 0  of  limit  values  is  defined.  Recall 
that  the  sequence  is  obtained  by  removing  from  rQ 0  all  transactions  whose 
corresponding  status  is  not  dequeued.  We  prove  that  this  sequence  is  equal  to 
the  concatenation  of  all  sequences  of  transactions  dequeued  by  our  processor  in 
the  run  being  considered  (Lemma  16).  Thus  the  correctness  of  the  processor  can 
indeed  be  expressed  as  in  Theorem  stated  in  Sect.  6.  We  repeat  it  here: 

Theorem,  is  an  execution  sequence. 

We  also  repeat  the  four  Propositions  which,  in  view  of  Definition  3,  imply 
the  theorem. 

Proposition  1.  The  sequence  is  infinite. 

Proposition  2.  IfU  andT  are  two  consecutive  elements  of  r then  npc(C/)  = 
addr(T).  Also ,  the  value  or  the  addr  field  of  the  first  transaction  ofr^  id  pcinit. 

Proposition  3.  Let  T  be  a  transaction  in  t£>.  IfU  is  the  rth  register  provider 
ofT  in  r£,  then  rOp r(T)  =  rRes (£/),  and  ifT  does  not  have  an  rth  provider  in 
r^j  then  rSourcesr(T)  =  rfinit(rSourcer(T)). 

Proposition  4.  Let  T  be  a  transaction  in  r^.  If  U  is  the  store  provider  of  T 
in  t^,  then  mOp(T)  =  mR  es(U),  and  ifT  does  not  have  a  store  provider  in  r£, 
then  mOp(T)  =  meminii(mSource(T)). 

The  proofs  of  the  propositions  are  given  in  Subsections  A.5-A.8.  The  defini¬ 
tion  and  some  basic  properties  of  the  history  table  are  given  in  Subsection  A. 3. 
The  first  two  subsections  contain  notational  preliminaries  and  key  lemmas  about 
the  relationships  among  the  processor’s  components. 


A.l  Terminology 

Regular  and  singular  cycles.  For  a  given  run  of  the  processor,  the  value 
of  any  state  variable  v  at  the  cycle  n  (n  >  1)  will  be  denoted  by  vn.  Define 
n  to  be  regular  or  singular  depending  on  whether  f  lushn  is  FALSE  or  TRUE. 
Note  that  n  is  singular  if  and  only  if  dequeued71  is  non-empty  and  the  last 
transaction  in  it  is  mispredicting  (Ord-Inv  5).  Note  also  that  if  n  is  singular, 
then  queue71,  executing71*1  and  executing71*1  are  empty,  by  Ord-Inv  2  and 
Exec-Inv  1  respectively.  As  a  consequence,  we  have  that  two  consecutive  numbers 
cannot  be  both  singular. 


17 


Locations.  Let  us  use  the  term  location  for  the  four  wires  (fetched,  prepared, 
computed,  dequeued)  and  the  four  state  elements  (young,  mature,  executing, 
lingering)  that  serve  as  transaction  holders  in  our  processor’s  specification. 
In  addition  to  these,  we  will  also  consider  a  few  more  defined  “locations”, 
some  of  which  have  previously  been  defined  or  just  mentioned.  First  we  have 
queue”  =  mature”  ^young”  and  contents”  =  executing”  +  lingering”,  the 
full  contents  of  the  ordering  and  the  execution  units  respectively.  Then  we  have 
enqueued”,  a  prefix  of  fetched”,  defined  when  n  is  regular  and  with  properties 
given  in  Ord-Inv  2  and  Ord-Inv  3.  We  define  enqueued”  =  ()  when  n  is  singular. 
Furthermore,  we  define  ignored”  by  fetched”  =  enqueued”  ^ignored”  when 
n  is  regular,  and  ignored”  =  ()  when  n  is  singular,  ripe”  is  the  transaction  set 
consisting  of  complete  transactions  in  mature”.  Finally,  when  n  is  regular,  we 
define  squashed”  =  (),  and  when  n  is  singular,  we  define  squashed”  to  be  the 
suffix  of  queue”-1  ^fetched”-1  of  length  complementary  to  | dequeued” |. 

Note  that  the  nine  location  names  are  used  to  name  the  nodes  of  the  trans¬ 
action  diagram  T  in  Fig.  3.  If  X  and  Y  are  two  nodes  of  T  we  will  write  X  <  Y 
if  X  =  Y  or  there  exists  a  sequence  of  arcs  in  T  leading  from  X  to  Y.  There  are 
no  non-trivial  cycles  in  T,  so  this  is  a  partial  order  relation. 


Ancestors  and  ordinals.  A  simple  fundamental  observation  is  that  any  trans¬ 
action  present  in  the  processor  at  any  cycle  in  any  of  the  eight  basic  locations 
except  fetched  has  a  uniquely  determined  immediate  ancestor  among  transac¬ 
tions  present  in  the  processor  at  the  previous  cycle.  Note,  however,  that  it  is 
not  realistic  to  assume  that  this  relationship  is  “one-to-one”.  For  example,  in 
the  model  we  are  considering,  each  transaction  in  prepared”  wire  has  a  copy  of 
itself  saved  in  mature”  and  each  transaction  in  executing”  or  computed”  also 
has  a  copy  of  its  ancestor  waiting  in  mature”.  Choosing  a  unique  “descendant” 
of  a  fetched  instruction  in  all  subsequent  cycles  is  tantamount  to  the  definition 
of  the  history  table;  see  A. 3. 

Since  the  initial  value  X1  is  empty  for  every  X  ±  fetched,  it  follows  that 
starting  with  any  transaction  T  belonging  to  a  location  Xn  one  can  define  a 
sequence  of  transactions  in  which  each  is  the  immediate  ancestor  of  the  previous 
one  and  which  terminates  at  a  transaction  To  belonging  to  f  etched*  for  some 
k  <n.  This  T0  is  a  uniquely  defined  progenitor  of  T.  The  ordinal  of  T  is  defined  to 
be  the  ordinal  of  To  in  the  sequence  all-fetched  =  fetched1  ^fetched2  *  ■  ■ 
of  all  fetched  transactions. 

It  remains  to  give  a  precise  definition  of  immediate  ancestors.  So  suppose  X 
is  a  basic  location,  X  ^  fetched,  and  T  €  Xn.  We  define  the  ancestor  T'  of  T 
and  its  location  yn-1.  Consider  first  the  possibilites  executing,  lingering  and 
computed  for  X.  If  n  ~  1  is  regular,  then  T'  and  Y  are  found  from  the  inequality 

executing”"1 U prepared”-1 U lingering”-1  <  contents”  (2) 

of  Exec-Inv  2.  If  n  —  1  is  singular,  then  executing”,  lingering”  and  computed” 
are  empty,  so  there  is  nothing  to  define.  Turning  to  the  possibilities  young, 
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mature,  prepared  and  dequeued  for  X ,  we  obtain  the  corresponding  Tl  and  Y 
easily  from  the  relations 

young71-1  =H=  enqueued72  ■<  prepared71  #  young71,  (3) 

mature*  ^prepared71  =  dequeued71  ^mature71.  (4) 

of  Ord-Inv  2,  provided  that  n  is  regular.  And  if  n  is  singular,  then  prepared71, 
mature71  and  young71  are  empty  (Ord-Inv  2)  so  there  is  nothing  to  do  for  them, 
while  for  dequeued71  we  have  that  it  is  a  prefix  of  a  sequence  mature*,  where 
each  member  of  mature*  belongs  to  either  mature71-1  and  computed71-1. 

Note  that  in  all  cases  we  have  V  <  T. 

A.2  Between  processor  units 

From  the  informal  specification  of  the  ordering  unit  (Sect.  4)  we  expect  that 
transactions  in  mature71  should  fall  into  four  well-defined  classes:  for  each  T  in 
mature71,  T  is  either  complete  and  waiting  for  its  turn  to  be  dequeued,  or  there  is 
a  unique  transaction  associated  (by  name)  with  T  in  prepared71,  executing71,  or 
computed71.  Lemma  2  below  confirms  this  basic  relationship  between  the  contents 
of  the  ordering  and  the  execution  units.  Lemmas  3  and  4  state  two  important 
relationships  between  what  comes  in  and  what  goes  out.  They  refer  to  the  ex¬ 
ecution  unit  and  the  ordring  unit  respectively,  but  neither  can  be  derived  from 
the  axiomatics  of  a  single  unit. 

First  we  need  to  extend  our  notation  about  transaction  sets.  Transaction  sets 
are  disjoint  if  their  domains  are  disjoint  as  sets;  we  will  write  A 4*  B  for  Al)  B  in 
the  case  when  we  know  A  and  B  are  disjoint.  Define  A  \  B  to  be  the  restriction 
of  A  on  the  set  difference  of  the  domains  of  A  and  B.  Define  A  to  be  a  subset  of 
B  if  A(x)  =  B(x)  whenever  A(x)  ^  ().  We  will  write  A  —  B  for  A  \  B  when  we 
know  that  A  is  a  subset  of  B . 

Lemma  1.  If  n  and  n  —  1  are  regular ,  then 

executing71-1  U  prepared71-1  ■<  executing71  +  computed71. 

Proof.  Since  n  is  regular,  Exec-Inv  2  implies 

(executing71-1  Upreparedn-1)+ (lingering71-1  \prepared71-1)  ■<  exe cut ingn  +  lingering71. 

Since  n  —  1  is  regular,  Exec-Inv  3  implies 

lingering71  =  computed71  4-  (lingering71-1  \  prepared71-1). 

The  lemma  immediately  follows  from  these  relations.  □ 

Lemma  2.  For  every  regular  n,  the  sets  ripe71,  computed71,  executing71  and 
prepared71  are  disjoint ,  and 

mature71  <  ripe71  +  computed71  +  executing71  +  prepared71.  (5) 

Moreover,  the  corresponding  elements  on  the  two  sides  have  the  same  ordinal. 
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Proof.  The  proof  is  by  induction.  Since  the  initial  values  of  all  the  sets  involved 
are  empty,  the  initial  case  is  true.  The  induction  step  splits  into  two  cases, 
depending  on  whether  n  —  1  is  regular  or  not. 

Assume  first  n  —  1  is  not  regular.  By  Ord-Inv  2,  we  have  mature71-1  = 
young71-1  =  ()  and  then  prepared71  =  dequeued71  ^mature71.  This  implies 
mature71  =  prepared71  because  all  transactions  in  dequeued71  are  complete  and 
so  cannot  occur  in  prepared71,  which  (by  Ord-Inv  4)  contains  only  incomplete 
transactions.  It  remains  only  to  prove  that  the  sets  ripe71,  computed71  and 
executing71  are  empty.  For  ripe71  it  is  true  because  all  elements  of  mature71 
are  incomplete.  The  other  two  are  subsets  of  contents71  which  is  empty  by 
Exec-Inv  1. 

Assume  now  that  n  -  1  is  regular.  By  Ord-Inv  2,  we  have 

mature*  4~  prepared71  =  dequeued71  4-  mature71,  (6) 

where  mature*  is  obtained  by  replacing  every  transaction  in  mature71-1  with  an 
equally  named  transaction  of  computed71-1.  By  induction  hypothesis,  all  names 
of  computed71-1  occur  among  names  of  mature71-1,  so  we  have 

mature*  =  computed77-1  4-  (mature71-1  \  computed77-1).  (7) 

Combining  (6)  and  (7),  and  the  induction  hypothesis  in  the  form 

mature71-1  \  computed71-1  •<  ripe77-1  +  executing71-1  4- prepared71-1 , 
we  obtain 

dequeued77  4-  mature71  <  computed71-1  4-  ripe71-1  4-  executing71-1 

4-  prepared71-1  4-  prepared71.  (8) 

Observe  now  that  ripe71-1  4-  computed71-1  is  the  set  of  complete  transactions  in 
mature*;  this  follows  from  (7),  the  fact  that  all  transactions  in  computed71-1  are 
complete,  and  the  induction  hypothesis  implying  that  the  complete  transactions 
in  mature71-1  \  computed71  are  precisely  those  of  ripe77-1.  Since  no  transaction  of 
prepared71  is  complete  (Ord-Inv  4)  and  all  transactions  of  dequeued71  are  com¬ 
plete  (Ord-Inv  5),  it  follows  from  (6)  that  the  same  set  of  complete  transactions 
of  mature*  can  also  be  written  as  dequeued71  4-  ripe71.  Thus,  (8)  rewrites  into 

dequeued77  4-  mature71  ■<  dequeued71  4-  ripe71  4-  executing71-1 

4-  prepared71-1  4-  prepared77, 

and  the  desired  result  follows  immediately  from  Lemma  1. 

It  remains  to  go  back  and  check  that  the  ordinals  are  the  same  for  any  two 
correspondind  members  of  the  two  sides  of  any  equality  and  inequality  that  was 
used  in  the  proof.  This  is  done  by  a  straighforward  inspection.  □ 

Lemma  3.  If  n  and  n  —  1  are  regular ,  then 

executing71-1  4-  prepared71-1  <  executing77  4-  computed77. 
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Proof .  This  is  a  strengthening  of  Lemma  1;  that  prepared71  1  and  execn  1  are 
disjoint  is  a  part  of  Lemma  2.  □ 

Lemma  4.  If  n  and  n  -  1  are  regular,  then 

computed71  +  ripe71  —  dequeued71*1  +  ripe71*1  (9) 

and  all  transactions  of  this  set  belong  to  lingering71*1. 

Proof.  The  equation  is  proved  in  the  course  of  proving  Lemma  2.  As  in  the  proof 
of  Lemma  1,  we  have 

lingering71  =  computed71  +  (lingering71”1  \  prepared71”1), 

so  all  we  need  to  prove  is  that  ripe71  is  a  subset  of  lingering71*"1  \ prepared71"*1. 
Arguing  by  induction,  the  problem  reduces  to  showing  that  the  sets  ripe71  and 
prepared71-1  are  disjoint.  Indeed,  by  Exec-Inv  2  and  Exec-Inv  3,  every  trans¬ 
action  in  prepared71-1  has  a  descendant  in  executing71  or  computed71,  and  by 
Lemma  2,  these  two  sets  are  disjoint  from  ripe71.  □ 

A. 3  Definition  of  the  history  table 

The  top  row  of  Fig.  3  depicts  all  possible  paths  through  selected  processor  lo¬ 
cations  that  a  normally  completed  transaction  can  have,  form  fetching  through 
retiring.  A  transition  from  X  to  Y  in  most  cases  should  be  interpreted  as  “it  is 
possible  that  a  transaction  in  Xn  has  a  corresponding  transaction  in  y71*1” .  The 
diagram  also  suggests  that  all  transactions  in  Xn  should  have  a  corresponding 
transaction  in  some  y71*1  for  some  Y,  the  target  node  of  an  arc  coming  from  X. 
“Corresponding”  here  means  having  the  same  ordinal,  i.e.,  being  related  to  the 
same  fetched  instruction.  Our  goal  is  to  define  the  history  of  execution  of  any 
fetched  instruction,  so  we  would  like  to  define  “transitions”  (T,  X)  ^  (T1  ,Y) 
with  (T',  y)  uniquely  determined  by  (T,  X).  When  more  than  one  such  transition 
is  possible,  we  select  the  right  one  according  to  the  values  of  “control  variables” 
(flush  in  our  case). 

Transaction  flow.  The  subgraphs  of  T  defined  in  Figs.  5-7  represent  the  trans¬ 
action  flow  between  cycles  n  and  n  - hi,  depending  on  whether  these  numbers 
are  regular  or  singular.  The  following  lemma  states  this  in  precise  terms. 

Lemma  5.  Let  n  >  2  and 

{rrr  if  both  n  -  1  and  n  are  regular 
rrs  if  n  ^  singular 
rsr  if  n-  1  is  singular 

and  let 

Inn  =  {(T,  X)  |  T  €  Xn~l  and  X  is  the  source  of  an  arrow  of  rn}, 

Outn  =  {(T#,y)  |  T*  €  Yn  and  Y  is  the  target  of  an  arrow  of  Tn}. 
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Fig.  6.  rT 
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Then  the  relation  “ have  the  same  ordinal”  defines  a  bijection  Sn:  Inn  Outn. 
Moreover ,  if  5n(T,X)  =  (T',Y),  then  X  and  Y  are  joined  by  an  arrow  in  rn, 
and ,  in  the  cases  rn  =  rrr  and  Tn  =  rsr,  T  . 

Proof  We  claim  that  if  n  +  1  is  regular,  then 

youngn  ^fetched71  ■<  prepared71"1"1  ^youngn+1  $  ignored714"1,  (10) 

and  if  both  n  and  n  -f  1  are  regular,  then 

prepared71  +  executing71  <  executing71"1"1  +  computed714"1,  (11) 
computed71  +  ripe71  =  dequeued71"1*1  +  ripen+1.  (12) 

Indeed,  (10)  follows  from  (3)  and  fetched71  =  enqueued714"1  ^ignored71"1"1,  and 
(11)  and  (12)  follow  from  Lemma  3  and  Lemma  4  respectively.  The  case  of 
the  lemma  when  Pn  =  rrr  immediately  follows  from  these  relations.  Since 
young71  =  ()  when  n  is  singular,  the  case  Tn  =  rrs  follows  from  (10)  alone. 
Finally,  in  the  case  when  rn  =  Fsr  we  have  that  squashedn+1  is  the  suf¬ 
fix  of  mature71  ^young71  #f  etched71  of  length  complementary  to  the  length  of 
dequeued714"1  and  that  dequeued714"1  is  a  prefix  of  mature*,  the  sequence  ob¬ 
tained  by  updating  mature71  with  transactions  of  computed71.  The  lemma  now 
easily  follows  from  Lemma  2.  □ 

History  table.  Recall  the  definition  of  ordinals  of  transactions.  In  particular, 
transactions  in  the  sequence  all-fetched  =  fetched1  ^fetched2  $  ••  •  have 
distinct  ordinals.  For  every  i  >  1,  we  define  the  nascency  rank  nr (i)  to  be 
the  number  n  such  that  fetched71  contains  a  transaction  with  ordinal  i.  (If 
all-fetched  is  finite,  then  nr(z)  would  be  defined  only  for  i  <  |all-f etched], 
but  we  will  prove  that  all-fetched  is  infinite,  so  nr  is  defined  for  every  positive 
integer.) 

Definition  4.  For  a  given  run  of  the  processor  and  every  n  and  i  such  that 
n  >  nr(z)  define  Hln  inductively  as  follows: 

1 .  Ifn  =  nr(z),  then  iJ1  =  (T,  fetched),  where  T  is  the  transaction  in  fetched71 
whose  ordinal  is  i. 

2.  If  =  ( T,X )  and  X  is  a  final  location ,  then  Hln  = 

3.  If  H%n_x  —  (T,  X)  and  X  is  not  final,  then  H xn  =  5n_i(JT^_1). 

The  history  table  H  is  the  table  whose  element  belonging  to  the  nth  row  and  the 
ith  column  is  Hln . 

The  sequence  of  elements  occuring  in  the  nth  row  of  H  will  be  denoted  by 
Hn.  The  transaction  and  the  status  components  of  Hln  will  be  denoted  T„  and 
X„  respectively.  The  sequence  of  transaction  components  of  Hn  will  be  denoted 
rn  and  the  sequence  of  the  status  components  of  Hn  will  be  denoted  fn. 

Lemma  6.  The  definition  of  the  history  table  is  correct. 
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Proof.  The  only  thing  that  needs  to  be  checked  is  that  if  H —  (T,  A)  and  X 
is  not  final,  then  (T,  X)  belongs  to  Jnn,  the  domain  of  Sn .  If  X  =  fetched,  then 
this  is  obvious.  Otherwise,  (T,  X)  =  Jn-^T'jA'),  so  (T,  X)  €  Outn- 1.  Thus, 
A"  is  a  target  of  an  arrow  in  Tn_  i  and  (by  inspection  of  the  eight  possibilities 
for  r„_i  and  Tn)  it  follows  that  X  is  a  source  of  an  arrow  of  Fn,  finishing  the 
proof.  □ 


A. 4  Basic  properties  of  the  history  table 
Lemma  7.  If  Hln  is  defined  then 

1.  x'n<xi+i; 

2.  T i  ■<  2*+1;  provided  A7+1  ^  squashed. 

Proof  The  proof  follows  immediately  from  Definition  4  and  Lemma  5.  □ 

Lemma  8.  All  elements  of  Out n  occur  in  Hn.  □ 

Proof  The  proof  is  obtained  by  strengthening  the  last  argument  in  the  proof  of 
Lemma  6  by  using  bijectivity  of  Sn.  □ 

Note  that  the  statuses  related  to  the  execution  unit  (prepared,  executing, 
computed)  do  not  occur  in  Hn  when  n  is  singular,  so  that  the  descendancy 
relation  of  Sect.  7  is  not  exactly  reflected  in  the  history  table.  In  transitions 
between  regular  cycles,  however,  the  descendancy  in  the  execution  unit  can  be 
seen  in  the  table,  as  stated  in  the  following  lemma,  easily  derived  from  definitions. 

Lemma  9.  If  n  and  n  +  1  are  both  regular  and  X\  is  prepared  or  executing, 
then  X£+1  is  the  descendant  of  (in  the  sense  of  Sect.  7).  □ 

Let  Act  denote  the  set  consisting  of  the  five  nodes  of  r  that  are  neither 
initial  nor  final.  Let  active71  be  the  sequence  obtained  from  rn  by  removing  all 
transactions  whose  corresponding  location  is  not  in  Act. 

Lemma  10.  For  every  n,  queue71  <  active71.  Moreover,  if  n  is  regular,  then 
active71  is  the  sequence  obtained  by  replacing  every  transaction  in  mature71  with 
an  equally  named  transaction  in  the  set  computed71  +  executing71. 

Proof.  Suppose  first  n  is  singular.  Since  queue71  =  (),  we  need  to  show  that 
active71  =  ()  too.  Indeed,  all  elements  of  Outn  are  of  the  form  (T,  X),  where  X 
is  either  dequeued  or  squashed  (Fig.  6),  and  by  Definition  4,  the  status  of  all 
elements  in  the  nth  row  of  H  is  final. 

Suppose  now  n  is  regular.  By  Lemma  8,  if  A  E  Act  and  T  €  A71,  then 
(T,  A)  occurs  in  Hn.  Then,  by  Lemma  2,  there  exists  a  bijection  T  Tf  between 
elements  of  active71  and  queue71  such  that  T!  -<  T.  All  that  remains  to  prove  is 
that  the  elements  of  queue71  =  mature71  ^ young71  have  increasing  ordinals,  and 
that  follows  easily  from  the  definitions  of  ancestors  and  ordinals.  □ 
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Now  we  can  derive  an  often  used  lemma  that  guarantees  existence  of  regular 
number  intervals. 

Lemma  11.  If  Xln  <  dequeued  and  nr (2)  <  m  <  n,  then  m  is  regular. 

Proof.  Since  m  >  nr (2),  we  have  Xxm  >  fetched.  Since  m  <  n,  we  have  Xlm  < 
dequeued  (Lemma  7).  Thus,  X ^  G  Act  and  so  active72  ^  ().  Since  queue72  =  () 
when  n  is  singular  (Ord-Inv  2),  the  result  follows  from  Lemma  10.  □ 

The  following  is  a  stabilization  lemma  for  columns  of  the  history  table. 

Lemma  12.  For  every  i,  the  sequence  Hln  is  eventually  constant. 

Proof.  Let  Hln  =  (Tn,  Xn).  By  Lemma  7  Xn  <  Xn+i  in  r.  Since  T  is  finite  and 
the  only  cycles  in  it  are  loops  at  nodes,  it  follows  that  the  sequence  Xn  stabilizes 
at  n0,  say.  Let  X  be  its  limit  value.  If  X  is  final,  then,  again  by  Definition  4, 
H 4  stabilizes  as  well.  The  remaining  possibility  is  that  X  is  young,  executing 
or  ripe.  By  Lemma  11,  all  numbers  greater  than  n0  are  regular.  By  Lemma  5, 
we  then  have  Tn  ■<  Tn+ 1  for  all  n  >  n0.  By  definition  of  progress  ordering, 
all  strictly  increasing  chains  of  transactions  are  finite,  so  the  sequence  Tn  is 
eventually  constant.  □ 

The  cycle  at  which  the  sequence  Hln  assumes  its  stable  value  will  be  denoted 
sr(2),  the  stabilization  rank.  The  limit  row  Hoo  is  the  sequence  of  stable  val¬ 
ues:  =  lim nH^.  The  sequence  of  transactions  and  the  sequence  of  statuses 

occurring  in  Hoo  will  be  denoted  r oo  and  £00  • 

Lemma  13.  The  sequence  dequeued724”1  is  a  prefix  of  active72. 

Proof.  By  Ord-Inv  2,  dequeued724"1  is  a  prefix  of  mature*,  the  sequence  ob¬ 
tained  by  replacing  transactions  in  mature72  with  equally  named  transactions 
of  computed72.  When  n  is  singular,  dequeued724"1  is  empty  because  mature72  is 
empty.  When  n  is  regular,  the  corollary  follows  from  Lemma  10.  □ 

Let  r4"  be  the  sequence  obtained  from  rn  by  deleting  all  members  whose 
corresponding  status  is  ignored.  Let  also  r^s  be  the  sequence  obtained  from  rn 
by  keeping  only  its  members  whose  status  is  dequeued  or  squashed. 

Lemma  14.  r4*  =  r£s  ^active72  #f  etched72. 

Proof.  The  proof  is  by  induction.  Assume  r4"  has  the  given  form.  By  Definition  4, 
fetched724”1  is  a  suffix  of  r++1.  The  prefix  r£s  remains  intact  in  r^+1,  by  the 
same  definition. 

By  Lemma  13,  dequeuedn+1  occurs  as  a  prefix  in  active72  and  so,  by  Defini¬ 
tion  4,  it  will  occur  at  the  corresponding  places  in  rn+ 1.  Therefore,  the  sequence 
r£s  # dequeued724"1  is  a  prefix  of  r++1 .  Now,  if  n+1  is  regular,  then,  for  each  of 
active72  which  does  not  occur  in  the  prefix  dequeuedn+1,  we  have  G  Act 

(diagram  Trr  or  T5r,  though  in  the  latter  case  there  are  no  such  elements  T £). 
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Also,  if  is  in  fetched71,  then  X2+1  G  Act  or  Xln+1  =  ignored.  This  fin¬ 
ishes  the  proof  if  n  -h  1  is  regular.  If  n  +  1  is  singular,  then  for  every  in 
active71  {{fetched72  that  does  not  belong  to  the  prefix  dequeued72*1,  one  has 
=  squashed  (diagram  rrs).  □ 

As  an  immediate  consequence  of  this  lemma  and  its  proof,  we  obtain  the 
following. 

Lemma  15.  For  every  n  >  1,  r^s  =  {{dequeued72  {{squashed72.  If  n  is 
singular,  then  r®s  =  r* .  □ 

Lemma  16.  —  dequeued1  {{ •  •  •  {{dequeued72  and  r®  is  a  prefix  of  t£>. 

Proof.  By  induction,  using  Lemma  15.  □ 

Let  £+  be  the  sequence  obtained  from  £n  by  deleting  all  members  equal  to 
ignored. 

Lemma  17.  For  every  n,  the  sequence  £+  regarded  as  a  string,  belongs  to  the 
set  defined  by  the  regular  expression 

{dequeued,  squashed}* {executing,  computed, ripe}* {prepared}* {young} *{f  etched}*. 
Moreover,  if  n  is  singular,  then  the  regular  expression  can  be  restricted  to 
{dequeued,  squashed}* {fetched}*. 

Proof  The  lemma  follows  from  Lemma  15  and  a  simple  observation  that  the 
sequence  prepared72  {{ignored72  is  a  suffix  of  queue72  that  occurs  also  as  a  suffix 
in  active72  (see  Lemma  10).  □ 

All  transactions  in  fetched72  have  maximal  values  in  their  fields  instr,  addr 
and  spc,  and  the  field  name  is  maximal  in  transactions  of  young72.  In  prepared72, 
all  transactions  have  maximal  values  in  their  fields  opcode,  rSources  and  rDest. 

In  computed72  transactions  are  complete  and  so  have  maximal  values  in  all  fields. 
Combining  these  observations  with  Lemma  7,  we  obtain  the  following  lemma, 
often  used  without  being  explicitly  mentioned. 

Lemma  18.  Fix  i,  letp,q  >  nr (z)  and  denote  =  (TP,XP),  H*  =  (Tq,Xq). 

1.  field(Tp)  =  field(T9)  ^  ±  for  any  field  G  {instr,  addr,  spc} 

2.  Ifp,q  >  nr(i),  then  name(Tp)  =  nam e(Tq)  /  _L. 

3.  If  prepared  <  Xp,Xq  <  dequeued,  then  field(Tp)  =  field(Tg)  ^  ±  for  any 
field  G  {opcode,  rSources,  rDest} 

4.  If  computed  <  Xp,  Xq  <  dequeued,  then  Tp  =  Tq. 

□ 
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Lemma  19.  If  i  <  j,  then  nr (i)  <  nr {j).  If  i  <  j  and  X ^  ignored,  then 
sr (i)  <  sr (j). 

Proof  The  first  statement  is  an  immediate  consequence  of  the  definition  of  nr. 
For  the  second  statement,  if  X ^  =  ignored,  then  sr(z)  =  nr (f)  +  1  and  sr(i)  < 
sr  (j)  easily  follows.  The  interesting  case,  when  neither  of  X^X3^  is  ignored, 
follows  from  Lemma  16.  D 

Lemma  20.  Suppose  i  <  j  and  X*,X7  G  Act.  Then  X £,  =  dequeued  implies 
that  X ^  =  dequeued. 

Proof.  Suppose  the  lemma  is  not  true.  By  Lemma  17,  we  must  have  Xxn  = 
squashed.  Consequently,  sr(i)  is  singular.  By  Lemma  19,  n  <  sr(i)  <  sr(j). 
Lemma  11  discards  all  but  the  possibility  sr(i)  =  sr  (j).  This,  however,  contra¬ 
dicts  the  definition  of  squashed71  (dequeued  transactions  precede  transactions 
squashed  at  the  same  cycle).  □ 

A. 5  Proof  of  Proposition  1 
Lemma  21.  DO  fetched^  (). 

Proof  Assume  the  contrary:  <>□  fetched  =  ().  It  follows  then  from  (10)  that 
<>□  prepared  =  ().  Then  (11)  implies  OD  computed  =  (),  and  then  (12)  implies 
OD  dequeued  =  ().  Now  from  Ord-Inv  9  we  deduce  On  rpc  =  ()  and  reach  a 
contradiction  with  Fetch-Liv.  □ 

Lemma  22.  no  queue  ^  (). 

Proof  Assume,  on  the  contrary,  that  On  queue  =  ().  Then,  by  Ord-Inv  2, 
On  dequeued  =  ()  and  then,  by  Ord-Inv  7,  On  flush  =  FALSE.  Also  by  Ord- 
Inv  2,  On  enqueued  =  ().  By  Lemma  21,  there  exists  i  such  that  fetched1  ^  (), 
while  queue*1  =  enqueued^  =  dequeued*1  =  ()  and  f  lushfc  =  FALSE  for  all  k  >  i. 
Let  x  —  rpci+1.  By  Ord-Inv  9,  x  ^  (),  while  rpct+1  =  ()  for  all  k  >  i  +  1.  By 
Ord-Inv  10,  xpcfc  —  x  for  all  k  >  i.  Now  let  j  be  the  smallest  number  greater 
than  i  such  that  f  etched^’  ^  ();  it  exists  by  Lemma  21.  We  have  pc-rpc7+1  =  x 
and,  by  repeated  application  of  Fetch-Inv  3,  pc-rpo?  =  x  as  well.  If  T  is  the  first 
transaction  of  fetched* ,  then  addr (T)  =  x  (Fetch-Inv  2),  and  then  Ord-Inv  3 
implies  that  enqueued7-1"1  ^  (),  which  is  a  contradiction.  □ 

Lemma  23.  All  locations  occurring  in  the  entries  of  are  final. 

Proof.  Since  each  of  fetched,  prepared  and  computed  can  occur  at  most  once 
in  any  given  column  of  the  history  table,  none  of  them  can  occur  in  £00 .  We 
need  to  eliminate  the  possibility  of  occurrences  of  young,  executing  and  ripe. 
Assuming  the  contrary,  let  k  be  the  smallest  integer  such  that  X ^  =  X  is  one  of 
these  three  and  let  m  =  sr(&).  Then  H =  (T,  X)  for  some  T  and,  by  Lemma  10, 
queue71  begins  with  a  transaction  Tn  such  that  Tn  <  T ,  for  all  n  >  m. 
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Case  1:  X  =  young.  Now  we  have  young™  ^  ()  and  so  mature71  =  ()  for  all 
n  >  m.  This  implies  prepared71  =  {)  for  all  n  >  ra,  directly  contradicting 
Ord-Liv. 

Case  2:  X  =  ripe.  We  have  dequeued71  =  ()  for  all  n  >  m.  By  Lemma  2,  Tm 
equals  T  and  so  is  complete.  This  again  contradicts  Ord-Liv. 

Case  3:  X  =  executing.  Note  first  that,  by  Ord-Inv  8,  writemem771  =  TRUE  if 
T  is  a  store;  indeed,  Tn  is  a  store  and  is  not  complete  since  that  would  imply 
X  =  ripe.  Secondly,  by  Lemma  11,  all  numbers  greater  than  m  are  regular. 
Thirdly,  since  executing71  and  computed71  are  disjoint,  name(T)  does  not  occur 
in  computed71  for  any  n  >  m.  These  three  facts,  combined  with  Exec-Liv  imply 
that  T  is  not  independent.  Thus,  we  have  (1)  rOp^T)  =  JL  for  some  i,  or  (2) 
mrSt(T)  /  NONE.  Since  m  =  sr(fc)  and  X ^  =  computed,  it  follows  that  X^l_1  = 
prepared. 

If  (1)  holds,  then  by  Ord-Inv  4,  there  exists  U  in  queue771^1  such  that 
rProvi(T^„1)  =  name(I7).  If  (2)  holds,  then,  again  by  Ord-Inv  4,  there  exists 
U  in  queue71  such  that  mrSt(T^_x)  =  name(Z7).  In  both  cases  we  have  that 
i  depends  on  for  some  j  <  k.  Since  j  <  k  and  all  numbers  greater  than 
n  are  regular,  we  have  X =  dequeued  and  so  there  exists  n  such  that  T is 
in  computed71.  Then  Tj[  also  occurs  in  lingering71  (Exec-Inv  3).  By  Lemma  18, 
name(X^)  =  name(T^l_1),  so  T%  =  T  depends  on  contradicting  the  ax¬ 
iom  that  a  transaction  in  executing71  cannot  depend  on  any  transaction  in 
lingering71  (Exec-Inv  3).  □ 

Proof  of  Proposition  1.  If  i  is  the  ordinal  of  a  transaction  in  queue71,  Lemma  23 
implies  that  X ^  is  either  dequeued  or  squashed.  It  follows  then  from  Lemma  22 
that  £oo  contains  infinitely  many  entries  equal  to  dequeued  or  squashed.  In 
other  words,  the  sequence  r^s  is  infinite.  By  Lemma  15,  this  sequence  is  the 
concatenation  of  all  sequences  dequeued71  ^squashed71.  Since  dequeued71  ^  () 
whenever  squashed71  ^  ()  (by  definition  of  squashed  and  Ord-Inv  7),  it  follows 
that  dequeued71  ^  ()  for  infinitely  many  values  for  n,  and  therefore  is  infinite. 

A. 6  Proof  of  Proposition  2 

Let  and  be  two  consecutive  elements  of  r^.  We  need  to  prove  that 
npc(TcJ0)  =  addr(T^).  Let  m  =  sr(z),  n  =  sr^'),  mf  =  nr (z),  and  n1  =  nr(j);  by 
Lemma  19,  we  have  m  <  n  and  m!  <  n* . 

First  we  show  that  every  p  between  m  and  n  (if  it  exists)  is  regular.  Assume 
the  contrary:  there  exists  a  singular  p  such  that  m  <  p  <  n.  Then  de queue dp  ^ 
(),  by  Ord-Inv  5.  Thus,  there  exists  l  such  that  sr(Z)  —  p  and  X ^  =  dequeued. 
By  Lemma  19,  it  follows  that  i  <  l  <  jf,  contradicting  the  assumption  that 
and  are  consecutive  in  t£>. 

Assume  first  that  T ^  is  not  mispredicting;  the  other  case  will  be  considered 
separately.  Since  now  npc(Tc50)  =  spc(T^),  all  we  need  to  show  is  spc(T^)  = 
addr(T^).  If  m!  =  n’  then  T^,  and  are  members  of  fetched771'  and  both  be¬ 
long  to  enqueued771  +1.  Using  Lemma  20,  we  deduce  that  these  two  transactions 
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must  be  consecutive  in  fetched771'. Therefore,  spc(T^)  =  addr(T^,),  by  Fetch- 
Inv  2.  Now  assume  nf  >  m! .  Then  T^,+1  is  the  last  element  of  enqueued771  +1, 
T^/+ 1  is  the  first  element  of  enqueued77  +1,  and  enqueued7,  =  ()  for  all  r  between 
m'  +  l  and  n'  +  1  (if  there  are  any  such  r).  We  claim  that  all  r  between  m' 
and  n '  are  regular.  We  know  that  all  numbers  between  m '  and  m  are  regular 
(Lemma  11),  and  that  all  numbers  between  m  and  n  are  regular  (proved  above). 
Thus,  the  claim  fails  only  if  m  is  singular  and  n'  >  m.  Then  T ^  would  be  the 
last  transaction  in  dequeued771  (because  n>  m  now  and  dequeued7,  =  ()  for  all  r 
between  m  and  n),  contradicting  the  assumption  that  T is  not  mispredicting. 
We  can  conclude  that  xpcm'+1  =  spc(T^,)  from  Ord-Inv  10,  that  xpcm'+1  = 
=  xpc71'  (also  from  Ord-Inv  10),  and  that  xpc71'  =  addr(T^/+1)  from  Ord- 
Inv  3.  This  finishes  the  proof  in  the  case  when  is  not  mispredicting. 

Assume  finally  that  is  mispredicting.  It  follows  from  Ord-Inv  5  that  m  is 
singular  and  also  that  xpcm+1  =  npc(T^)  (Ord-Inv  9  and  Ord-Inv  10).  It  follows 
also  that  nf  >  m ;  otherwise  would  exist  and  would  be  in  active771,  which  is 
absurd  because  this  sequence  must  be  empty  since  m  is  singular. 

It  follows  that  T^,+1  is  the  first  element  of  enqueued71  +1  and  that  enqueued7*  = 
()  for  every  r  between  m  and  n'  +  1.  We  already  know  that  the  numbers  between 
m  and  n'  +  1  are  all  regular,  and  it  follows  from  the  Ord-Inv  axioms,  similarly 
as  in  the  previous  case,  that  xpcm+1  =  •  •  •  =  xpcn'+1  =  addr(7^,+1). 

We  also  need  to  prove  that  addr(T(^0)  =  pcinit.  We  do  have  pc1  =  pcinit  by 
Fetch-Init.  Let  n  =  nr(l).  By  Fetch-Inv  2,  addr(T^)  =  pc71""1.  Since  addr(Tj0)  = 
addr(T^)  (Lemma  18),  it  suffices  to  check  that  pc71”1  =  pc1.  In  view  of  Fetch- 
Inv  3,  this  reduces  to  proving  rpc771  =  ()  for  1  <  m  <  n  -  1.  The  last  claim  is 
a  consequence  of  Ord-Inv  9  and  simple  facts  flush771  =  FALSE  and  queue711  =  () 
for  all  m  <  n  -  2. 


A. 7  Proof  of  Proposition  3 

Lemma  24.  For  every  reg  €  Reg  and  n  >  1, 

rfn(re  )  —  /  r^es(^)  */T  is  the  last  element  in  r£  such  that  rDest(T)  =  reg 

\rf  in**  (reg)  if  such  T  does  not  exist 


Proof ■  The  proof  follows  from  Lemma  16  and  Ord-Inv  6. 


□ 


Denote  by  r°  the  sequence  obtained  from  rn  by  removing  all  its  elements  rln 
such  that  Xxn  is  ignored  or  squashed. 

Lemma  25.  Let  prepared  <  <  dequeued.  If  m  <  n  and  T is  the  rth 

register  provider  of  in  r°,  then  T £  is  the  rth  provider  of  in  r^.  Also ,  if 
does  not  have  the  rth  register  provider  in  r°,  then  T ^  does  not  have  the  rth 
provider  of  in  r, ^ . 
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Proof.  We  prove  only  the  first  assertion  of  the  lemma.  The  proof  of  the  second 
is  analogous. 

Since  X ^  <  X{n  and  X3m  <  X £,  neither  of  X^X^  is  ignored  or  squashed, 
so  are  in  r^.  By  Lemma  18,  rSourc er  (T^)  =  rSourcer(T£)  and  rDest(T^)  = 

rDest(T^).  Thus,  rSourcer(T^)  =  rDest(T^). 

Suppose  now  k  is  such  that  j  <  k  <  i,  is  in  r^,  and  rSourcer(T^)  = 
rDest(T * ).  Again,  by  Lemma  18,  we  have  rSourcer(T^)  —  rDest(T^),  so  T*  does 
not  belong  to  r°.  Thus,  X*  is  ignored  or  squashed.  The  only  possibility  for 
X *  =  ignored  would  be  that  m  =  n— 1  and  =  fetched,  but  that  contradicts 
Lemma  17.  If  X*  —  squashed,  then  it  would  follow  that  there  exists  p  such  that 
Tp  belongs  to  squashedp.  This  would  imply  that  Tp  belongs  to  squashed^,  which 
is  not  true.  □ 


Proof  of  Proposition  3.  Suppose  T  and  U  are  transactions  in  such  that  U  is 
the  rth  register  provider  of  T.  Let  i  and  j  be  the  ordinals  of  T  and  U  respectively. 
Denote  Hk  =  (T*,  Xjf)  and  H3k  =  ([/*,  Yk).  Let  n  =  sr (i)  and  let  m  be  the  unique 
integer  such  that  Xm  =  prepared.  Prom  Lemma  11  we  have  that  every  k  such 
that  m  <  k  <  n  is  regular. 

By  Lemma  16,  Un  is  the  rth  provider  of  Tn  in  r£.  Then,  by  Lemma  25,  Um 
is  the  rth  provider  of  Tm  in  By  Lemma  17,  Ym  >  prepared.  Note  also  that, 
being  an  element  of  prepared™,  Tm  belongs  to  queue™. 

Case  1:  Ym  =  dequeued.  By  Lemma  17,  Tm  does  not  have  an  rth  register  provider 
in  active™.  It  follows,  using  Lemma  10,  that  Tm  does  not  have  an  rth  provider 
in  queue™.  Let  reg  =  rSourcer(Tm).  By  Ord-Inv  4,  reg  ^  J_  and  rOpr(Tm)  = 
rf™(reg).  Since  Um  is  the  rth  provider  of  Tm  in  r^,  it  follows  that  Um  is  the 
last  transaction  in  whith  rDest  field  equal  to  reg.  It  follows  from  Lemma  24 
that  rOpr(Tm)  =  rRes {Um)  and  so  rSources(T)  =  rRes (17),  as  required. 

Case  2:  Ym  ^  dequeued.  Now  Ym  belongs  to  active™.  Since  Um  is  the  rth 
provider  of  Tm  in  r^,  it  follows  that  Um  is  the  rth  provider  of  Tm  in  active™  as 
well.  From  Lemma  10  we  deduce  that  U ^  is  the  rth  provider  of  Tm  in  queue™, 
where  U'in  <  Um.  It  follows  from  Ord-Inv  4  that  rOpr(Tm)  =  _L  and  rProvr(Tm)  = 
name(f7^J,  which  immediately  implies  rProvr(Tm)  =  name(I/m). 

Since  Tm  ■<  •  •  *  <  Tn,  rOp(Tm)  =  ±  and  rOp(Tn)  ^  ±,  there  exists  a  unique 
number  p  such  that  m  <  p  <  n,  rOp (Tp)  =  _L,  and  rOp(Tp+i)  ^  _L.  Since 
Tp  is  incomplete,  it  belongs  to  executing*3  or  preparedp.  From  Lemma  9  we 
conclude  that  Tp+i  is  the  descendant  of  Tp.  Furthermore,  Exec-Inv  5  implies  that 
there  exists  a  transaction  V  in  executingpU linger ingp  such  that  rProvr(Tp)  = 
nam e(U)  and  rOpr(Tp+i)  =  rRes(F)  ^  _L.  It  follows  that  name(Vr)  =  nam e(Um). 
We  claim  that  V  =  Up,  which  then  implies  rOpr(T)  =  rOpr(Tp+x)  =  rRes(Up)  = 
rRes(D’),  finishing  the  proof. 

Suppose  the  claim  is  not  true.  Then  Up  cannot  belong  to  activep  because 
this  sequence  contains  V  and  cannot  contain  two  transactions  with  the  same 
name.  It  follows  that  X£  =  dequeued,  so  there  exists  q  such  that  p  <  q  <  m 
and  Uq  is  in  computed9  and  so  in  lingering9.  Since  Tq  belongs  to  executing9 
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and  rPro vr(Tg)  =  nam e(Uq),  it  follows  that  Tq  depends  on  Uq.  This  contradicts 
Exec-Inv  3,  finishing  the  proof  of  the  claim. 


j  i 


Fig.  8.  Transactions  involved  in  the  resolution  of  a  register  dependency  (Case  2  of  the 
proof  of  Proposition  3). 


We  also  need  to  prove  rOpr(T)  =  rfinit(rSourcer(T))  in  the  case  when  T  = 
T does  not  have  the  rth  register  provider  in  Let  again  m  be  the  integer  such 
that  Xm  =  prepared.  By  Lemma  25,  T ^  does  not  have  the  rth  provider  in  r^.  As 
in  Case  1  above,  we  obtain  rOpr(T^)  =  rfm(reg),  where  reg  =  rSourcer(T^). 
Using  Lemma  24,  we  deduce  rf 771  (reg)  =  rfin^(reg),  finishing  the  proof. 

A. 8  Proof  of  Proposition  4 

Lemma  26.  Every  load  znprepared”+executing”  satisfies  the  condition  (LC). 

Proof.  By  Ord-Inv  4,  the  mOp  field  of  every  load  in  prepared”  is  _L,  so  (LC)  is 
true  for  such  loads.  Furthermore,  every  load  in  executing”  is  a  descendant  of  a 
load  in  prepared”"1  U executing”-1,  so  by  induction  on  n  and  using  Exec-Inv  4, 
it  follows  that  these  loads  also  satisfy  (LC).  □ 

Lemma  27.  If  computed”  contains  a  store,  then  this  store  is  the  first  transac¬ 
tion  in  active”. 

Proof  Suppose  the  lemma  is  not  true  and  pick  the  minimal  n  that  provides  a 
counter-example.  Suppose  T £  is  a  store  in  computed”  and  T jj  is  the  first  trans¬ 
action  in  active”,  and  j  <  i.  Pick  i  so  that  i  —  j  is  smallest. 

Let  U  be  the  first  transaction  of  queue”.  By  Lemma  10  U  <  T^.  By  By 
Exec-Inv  6,  writemem”  =  true  and  by  Ord-Inv  8,  U  is  an  incomplete  store. 
Using  Lemma  17,  we  conclude  that  T ^  belongs  to  executing”. 

Let  m  be  such  that  T \  belongs  to  prepared7”.  By  Lemma  19,  T ^  belongs 
to  active7”  and  so,  by  Ord-Inv  4,  mrSt(Z^)  =  name(T^)  for  some  k  such  that 
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j  <  k  <  i.  Since  mrSt(X£)  ^  mrSt(T^),  it  follows  from  Exec-Inv  7  that  for  some 
p  such  that  m  <  p  <  n  one  has  Tjf  in  computed^ 

Since  T 'l  is  in  executing71,  7JJ  is  not  in  computedp,  so  k  ^  j.  Thus,  T£  is  a 
store  in  computed*7  and  is  not  a  first  transaction  in  activep.  By  minimality  of 
n,  we  have  p  —  n  and  then  a  contradiction  with  the  minimality  assumption  on 
i.  □ 

Corollary  28.  If  active71  contains  a  complete  store ,  then  it  is  its  first  transac¬ 
tion.  If  dequeued71  contains  a  store ,  then  it  contains  only  one  and  it  is  its  first 
transaction.  □ 

Proof.  The  first  statement  is  an  immediate  consequence  of  Lemma  28.  For  the 
second,  use  also  Lemma  13  □ 

Corollary  29.  If  memn+1  ^  memn  then  the  first  transaction  S  in  active71  is  a 
complete  store  in  executing71  and  mem774-1  =  mem71  •  S. 

Proof.  The  proof  follows  directly  from  Exec-Inv  6  and  Corollary  28.  □ 

Lemma  30.  If  computed71  H-  ripe71  contains  a  store  S,  then  mem71  =  memn  •  S. 

Proof.  Suppose  S  =  T*.  Then,  for  some  m  <  n,  =  S  is  in  computed771,  and 
so,  by  Exec-Inv  6,  mem771  =  mem771-1  •  S.  Therefore,  mem771  •  S  =  mem777.  For  every 
p  such  that  m  <  p  <  n,  T*  is  in  ripep  and  is  the  first  transaction  in  active77. 
It  follows  from  Corollary  29,  that  memp  =  mem777  for  all  such  p.  In  particular, 
mem77  =  mem777  and  the  lemma  follows.  □ 

Lemma  31.  For  every  n,  mem77  =  mem or  mem77  =  mem^  •  r®  •  S,  where 
S  is  a  store  and  is  a  first  transaction  of  active77. 

Proof.  We  argue  by  induction.  The  initial  case  is  clearly  true.  For  the  induction 
step,  suppose  first  that  mem77  =  memin^  •  r£.  If  dequeued774-1  is  non-empty  and 
contains  no  store,  then  mem774-1  =  mem77  by  Lemma  29,  and  mem inu  •  r^+1  = 
meirijnif  •  ♦  dequeued77-1"1  =  mem inu  •  r^+1  is  clear.  If  dequeued774-1  contains  a 

store  5,  then  by  Lemma  28,  dequeued77-1-1  begins  with  S  and  contains  no  other 
stores.  Being  an  element  of  dequeued77-1-1,  S  belongs  to  computed77  or  ripe77 
(Eq.  12),  so  by  Lemma  29,  memn-fl  =  mem77.  On  the  other  hand,  Lemma  30 
implies  mem77  =  mem77  •  5  and  so  mem77  =  mem77  •  dequeued77-*-1  =  mem;™*  •  r^+1. 
Finally,  if  dequeued774-1  is  empty,  then  r^+1  =  and  both  mem774-1  =  mem77 
and  mem774-1  ^  mem77  are  possible.  The  desired  result  in  the  first  case  follows 
immediately,  and  in  the  second  case  it  follows  from  Lemma  29. 

Assume  now  the  second  possibility  for  the  inductive  hypothesis:  mem77  = 
roem;nj*  -5,  where  S  is  a  store  and  is  a  first  transaction  of  active77.  Lemma  29 
implies  mem774-1  =  mem77.  If  dequeued774-1  is  empty,  the  result  immediately  follows. 
If  dequeued77-1-1  is  non-empty,  then  it  begins  with  S  and  contains  no  other  stores, 
so  mem774-1  =  mem77  =  mem init  •  r£  ■  S  =  mem init  *  •  dequeued774-1  =  memini*  • 
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Lemma  32.  If  T  is  a  load  or  store  in  prepared71  4-  executing71  and  U  is 
the  most  recent  store  of  T  in  active71,  then  either  (1)  U  is  in  prepared71  4 
executing71  and  mrSt(T)  =  name({7),  or  (2)  U  is  in  computed71  4  ripe71  (and 
therefore  is  the  first  transaction  in  active71^  and  mrSt(T)  =  NONE. 

Proof  Let  Tj^  —  T  and  T £  =  U.  Let  m  be  such  that  T ^  is  in  prepared771.  Then 
T £  is  in  active771  and  so  mrSt(T^)  =  name(T^),  where  is  the  most  recent 
store  for  T ^  in  active771.  We  claim  that  k  =  j.  Otherwise,  using  Exec-Inv  7  we 
would  obtain  T£  in  computed77  for  some  p  <  n,  contradicting  Corollary  28.  The 
lemma  now  follows  from  Exec-Inv  7  and  Corollary  28.  □ 

Corollary  33.  Let  T  be  a  load  or  store  in  prepared71  4  executing71  and  let  <p 
be  the  store  chain  of  T  in  this  set.  Then  <p  is  equal  to  the  sequence  of  stores  in 
prepared71  +  executing71  that  precede  T  in  active71.  □ 

Lemma  34.  Let  L  be  a  load  in  prepared71  4  executing71  and  let  ip  be  the  pre¬ 
fix  of  active71  consisting  of  transactions  preceding  L.  Then  mOp (L)  X  (mem71  • 
^)(mSource(L)). 

Proof.  By  Lemma  26,  mOp(L)  X  (mem71  *  0)(mSource(L)),  where  <p  is  the  store 
chain  of  L  in  prepared71  +  executing71.  Let  ipo  be  the  sequence  obtained  by 
deleting  from  ip  all  transactions  which  are  not  stores.  Clearly,  mem 71 -ip  —  mem n-ipo* 
By  Corollary  28,  all  transactions  in  ipo  are  in  prepared71  4  computed71,  except 
possibly  the  first  store  (say,  5),  which  may  belong  to  computed71  4  ripe71.  By 
Lemma  33,  we  have  ipo  =  (pin  the  first  case,  and  ipo  —  (S)  #(p  in  the  second.  By 
Lemma  30,  mem71  •  (p  =  mem71  •  ipo,  finishing  the  proof.  □ 

Lemma  35.  Let  a  and  (3  be  transaction  sequences  such  that  ct  <  fi.  Let  mem  be 
an  element  of  type  IAddr  -4  Value  and  addr  an  element  of  type  IAddr.  Then 
(mem  •  a)  (addr)  (mem  •  /3)(addr). 

Proof.  By  direct  examination.  □ 

Proof  of  Proposition  !>.  Suppose  L  is  a  load  in  r£,.  Let  6  be  the  prefix  of  r£, 
consisting  of  transactions  that  precede  L.  We  will  prove  that  mOp(L)  =  (mem^  • 
0)(mSource(L)).  It  is  easy  to  see  that  this  would  imply  Proposition  4. 

Let  i  and  be  such  that  L  =  T^  and  and  let  n  be  the  largest  number  such 
that  is  in  prepared71  4  executing71.  Thus,  T£+1  is  in  computed714-1  and  it 
follows  from  Exec-Inv  7  that  mOp(T£)  ^  ±.  Thus,  mOp(X£)  =  mOp(L)  and 
mSource(T^)  =  mSource(L). 

From  Lemma  34  we  then  obtain  mOp(L)  -<  (mem71  ■  ip) (mSource(L)),  where  ip 
is  the  prefix  of  active71  consisting  of  transactions  preceding  L.  By  Lemma  31, 
mem71  is  equal  to  either  men ^nu  ■  r®  or  mem ina  -  t®  -S,  where  the  store  S  is  the  first 
transaction  of  active71.  Since  ip  is  a  prefix  of  active71  (and  L  is  not  a  store),  it 
follows  that  mem71  ■  ip  =  mem inu  •  r®  •  ip. 

By  Lemma  20,  all  transactions  of  ip  are  eventually  dequeued.  Thus,  r®  4t= ip 
6.  Using  Lemma  35,  we  finally  obtain  mOp(I/)  ^  (mem inu  -0)(mSource(L)),  which 
must  be  equality  because  mOp(L)  ^  _L. 
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Abstract 

Based  on  our  experience  with  modelling  and  verifying  mi¬ 
croarchitectural  designs  within  Haskell,  this  paper  examines 
our  use  of  Haskell  as  host  for  an  embedded  language.  In 
particular,  we  highlight  our  use  of  Haskell’s  lazy  lists,  type 
classes,  lazy  state  monad,  and  unsafePerformlO.  We  also 
point  to  several  areas  where  Haskell  could  be  improved. 

1  Introduction 

There  are  many  ways  to  design  and  implement  a  language  — 
not  all  of  them  imply  building  from  the  ground  up.  Landin’s 
vision  of  the  next  700  programming  languages  [18],  for  ex¬ 
ample,  was  to  build  domain-specific  vocabularies  on  top  of 
a  generic  language  substrate.  In  the  verification  community, 
this  is  known  as  a  shallow  embedding  of  one  language  or  logic 
into  another.  Prom  our  programming  language  perspective 
we  believe  that,  in  effect,  every  abstract  type  defines  a  lan¬ 
guage.  Admittedly,  most  abstract  types  by  themselves  make 
poor  languages,  but  when  interesting  combinators  are  pro¬ 
vided  the  language  suddenly  becomes  rich  and  vibrant  in 
its  own  right.  This  explains  the  continuing  popularity  of 
combinator  libraries,  from  the  time  of  Landin  until  now. 

The  animation  language/library  Fran  is  a  beautiful  ex¬ 
ample  [10,  9].  Fran  provides  two  families  of  abstract  types 
in  Haskell:  behaviors  and  events.  To  construct  a  term  of 
type  Behavior  Int,  for  example,  is  to  write  a  sentence  in 
the  Fran  language,  using  Fran  primitives  and  Fran  combi¬ 
nators.  To  build  complex  Fran  entities,  however,  the  full 
power  of  Haskell  can  be  brought  to  bear.  Fran  objects  are 
just  another  abstract  data  type. 

How  good  is  Haskell  at  hosting  other  languages?  This  is 
one  of  those  questions  that  can  only  be  answered  through 
experience — and  is  precisely  where  we  can  contribute.  In 
this  paper  we  describe  our  use  of  Haskell  as  a  host  to  a 
microarchitectural  modelling  language,  calling  attention  to 
the  aspects  of  Haskell  that  helped  us,  those  that  hindered  us, 
and  the  features  we  wish  we  had.  In  particular,  we  highlight 
our  use  of  Haskell’s  lazy  lists,  type  classes  [15],  the  lazy  state 
monad  [19],  and  unsafePerformlO  [17].  This  paper  contains 
no  deep  theory,  but  rather  a  dose  of  measured  introspection. 

The  remainder  of  this  paper  is  organized  as  follows:  In 
Section  2  we  provide  the  motivation  to  our  work  in  microar¬ 
chitectural  modelling.  In  Section  3  we  introduce  Hawk  and 
show  how  we  use  lazy  lists  to  model  wires.  In  Sections  4,  5, 
and  6,  we  show  how  type  classes,  the  lazy  state  monad,  and 
unsafePerformlO,  respectivly,  are  put  to  use  in  Hawk.  In 


Section  7  we  describe  an  application  that  makes  use  of  all 
four  features.  In  the  final  sections  we  outline  where  Haskell 
has  constrained  us,  and  discuss  future  work. 

2  Building  a  microarchitectural  description  lan¬ 
guage 

Contemporary  superscalar  microarchitectures  employ 
tremendously  aggressive  strategies  to  mitigate  dependencies 
and  memory  latency.  Their  complexity  taxes  current  design 
techniques  to  the  limit.  The  trend  continues,  as  the  size  of 
design  teams  grows  exponentially  with  each  new  generation 
of  chip. 

To  gain  an  appreciation  for  the  complexity  of  modern  mi- 
croarchitectures,  take  as  an  example  the  model  of  an  instruc¬ 
tion  reorder  buffer  (ROB)  which  occurs  frequently  in  out-of- 
order  microprocessors  like  the  Pentium  III.  The  function  of 
the  ROB  is  to  maintain  a  pool  of  instructions,  and  to  deter¬ 
mine  dynamically  which  of  them  are  eligible  for  delivery  to 
an  execution  unit  once  their  operands  have  been  computed. 
This  way,  instructions  are  executed  at  the  earliest  possible 
moment.  Furthermore,  instructions  are  introduced  spec¬ 
ulatively,  based  upon  numerous  successive  branch  predic¬ 
tions.  Consequently,  instructions  that  have  previously  been 
scheduled  and  executed  must  sometimes  be  rescinded  when 
a  branch  is  discovered  to  have  been  mispredicted.  Thus  the 
ROB  must  keep  track  of  instructions  up  to  the  point  that 
they  can  either  be  retired  (committed)  or  flushed. 

Since  some  instructions  following  a  branch  may  already 
have  been  executed  when  a  branch  misprediction  is  discov¬ 
ered,  register  contents  are  also  affected.  At  a  branch  mis¬ 
prediction,  register  mapping  tables  must  be  modified  to  in¬ 
validate  the  contents  of  registers  that  contain  results  of  re¬ 
scinded  instructions.  The  contents  of  registers  that  are  pos¬ 
sibly  live  must  be  preserved  until  after  the  branch  has  been 
resolved,  thus  increasing  the  complexity  of  the  interaction 
between  a  ROB  and  the  registers. 

In  addition,  there  are  all  the  issues  of  managing  on-chip 
resources,  of  ensuring  rapid  and  correct  communication  of 
results,  of  cache  coherence  and  so  on.  It  will  get  worse. 
The  next  generation  of  microarchitectures  will  address  many 
more  issues  such  as  explicit  instruction  parallelism  [13]  and 
multiple  instruction  threads  [29]. 

As  if  all  these  algorithms  did  not  provide  enough  de¬ 
sign  complexity,  commercially  viable  microarchitectures  are 
also  subject  to  legacy  requirements.  For  example  Intel’s 
Pentium  III  must  deal  with  dozens  of  exception  types  to 
remain  compatible  with  earlier  versions  of  the  X86  archi- 


tecture.  Pentium  III  also  struggles  with  the  variable  length 
of  X86  instructions.  It  tries  to  fetch  three  each  cycle,  and 
it  turns  out  that  dynamically  determining  the  length  of  in¬ 
structions  before  decoding  is  one  of  Pentium  Ill’s  primary 
performance  bottlenecks.  Again,  this  type  of  problem  is  not 
going  to  go  away.  Intel’s  upcoming  Merced  processor  will 
execute  not  only  its  new  instruction  set  [8],  but  X86  as  well 
[12]- 

With  designs  of  this  complexity,  it  is  hard  to  imagine  that 
designers  will  not  stumble  upon  subtle  concurrency  bugs. 
The  need  for  powerful  and  effective  modelling  and  verifica¬ 
tion  has  never  been  greater.  By  couching  microarchitecture 
modelling  in  terms  of  higher-level  abstractions  and  empha¬ 
sizing  the  modularity  of  a  design  it  is  possible  to  regain 
control  of  the  design  space.  This  is  what  we  have  done. 
In  conjunction  with  Intel’s  Strategic  CAD  Laboratory,  we 
have  developed  Hawk  as  an  executable  modelling  language 
embedded  in  Haskell.  Hawk  is  very  high  level  compared 
with  other  hardware  description  languages.  Consequently, 
even  complex  microarchitecture  models  remain  remarkably 
brief,  allowing  designers  to  retain  a  high  level  of  intellectual 
control  over  the  model.  For  example,  the  complete  formal 
model  of  a  speculative,  superscalar,  out-of-order  microar¬ 
chitecture  based  on  the  Pentium  III  required  less  than  1000 
lines  of  code  [5]. 

3  Lazy  lists:  adding  signals  to  Haskell 

Effectively,  Hawk  is  an  embedding  of  Lustre-style  signals  [4] 
into  Haskell.  Signals  model  values  that  change  over  time, 
like  wires  in  a  microprocessor.  Following  O’Donnell  [24], 
Srivas  h  Bickford  [28],  and  many  others,  we  implement  sig¬ 
nals  as  lazy  lists.  The  idea  is  very  simple:  the  nth  element 
of  the  list  represents  the  value  of  the  wire  at  clock  tick  n. 
Thus  the  value  of  each  wire  is  a  complete  description  of  its 
behavior  over  time.  This  approach  leads  to  circuit  seman¬ 
tics  with  a  definite  denotational  flavor.  In  contrast,  state 
transition  systems  (another  popular  style)  are  much  more 
operational  in  their  nature.  There  are  naturally  advantages 
and  disadvantages  to  each. 

To  represent  units  with  clocked  inputs  and  clocked 
outputs  we  use  functions  from  signals  to  signals,  known 
as  list  transformers  (or  stream  transformers) .  Com¬ 
binational  circuits  can  be  turned  into  clocked  circuits 
simply  by  mapping  them  down  their  input  lists.  So 
if  add: :  (Int  , Int)->Int  acts  like  a  simple  addition  cir¬ 
cuit,  then  map  add  ::  [(Int, Int)]  ->  [Int]  is  its  clocked 
equivalent. 

The  fundamental  non-combinational  circuit  is  the  delay . 
The  delay  is  what  makes  feedback  loops  in  clocked  circuits 
possible — without  any  delays,  a  feedback  loop  would  just 
generate  smoke!  A  delay  is  defined  so  that  the  (n  +  l)st 
element  of  the  output  is  equal  to  the  nth  element  of  its  in¬ 
put,  with  an  initial  value  output  for  the  very  first  clock  tick. 
The  implementation  of  delay  : :  a  ->  [a]  ->  [a]  is  sim¬ 
ply  “cons”. 

Some  care  is  needed  within  this  paradigm,  however. 
Arbitrary  use  of  list  processing  functions,  especially  those 
which  discard  elements,  such  as  filter,  can  cause  problems 
in  that  they  may  require  infinite  buffers  to  implement.  To 
restrict  the  way  in  which  a  signal  can  be  constructed  or  al¬ 
tered,  we  make  the  signal  type  abstract  in  Hawk  and  provide 
a  basic  set  of  manipulation  functions  that  are  known  to  be 
safe. 


newtype  Signal  a 


delay 

liftO 

liftl 


a  ->  Signal  a  ->  Signal  a 
a  ->  Signal  a 

(a  ->  b)  ->  Signal  a  ->  Signal  b 


liftO  returns  a  constant  signal;  and  liftl  is  just  map. 
Later  we  will  use  the  derived  operator  bundle,  which  takes 
a  pair  of  signals,  and  produces  a  signal  of  pairs.  Restrict¬ 
ing  access  to  the  implementation  in  this  way  gives  the  usual 
freedoms  to  provide  alternative  implementations,  or  even  to 
refine  the  semantics  somewhat.  For  example,  we  could  im¬ 
plement  signals  as  functions  from  the  natural  numbers  to 
values. 

If  the  above  signature  seems  to  be  missing  something 
—  it  is.  The  rest  comes  from  Haskell  itself,  in  particular, 
lazy  recursive  definitions.  You  could  say  that  the  missing 
operator  of  the  abstract  type  is  a  (lazy)  fixpoint  operator. 
Consider  a  resettable  counter  circuit  like: 

reset 

(  uFr° )  mux  ) — — « 

[hft(+d] 

next  - [delay  oj- - — J 

which,  in  Hawk,  we  might  model  as: 

counter  reset  =  out 
where 

next  =  delay  0  (liftl  (+1)  out) 
out  =  mux  reset  (liftO  0)  next 

Note  the  mutual  recursion  between  signals.  The  laziness 
of  Haskell  is  vital  for  this  definition  to  have  the  intended 
meaning. 

One  thing  that  is  not  missing  is  a  way  to  observe  a  list 
by  taking  its  head  or  tail.  This  is  intentional.  A  circuit  that 
was  specified  to  take  the  tail  of  a  list  would  be  asking  for  an 
infinite  buffer.  We  do  allow  signals  to  be  viewed  as  lists  for 
the  purpose  of  viewing  simulation  results,  but  this  operation 
is  only  provided  for  use  at  the  top-level. 

4  Organizing  microarchitectural  abstractions  with 

type  classes 

The  point  of  Hawk  has  been  to  build  abstractions  that  in¬ 
crease  the  concision  of  microarchitectural  models  [5],  and 
facilitate  the  verification  process  [22]. 

In  order  for  microarchitectural  abstractions  to  be  rele¬ 
vant,  they  must  be  extraordinarily  flexible  in  the  types  that 
they  operate  over.  Instruction  sets  differ  in  variety  of  de¬ 
tails:  size  and  type  of  data,  number  and  types  of  registers, 
and  the  instructions  themselves.  Internally,  machines  may 
use  other  instruction  sets.  For  example,  the  AMD  K6[27] 
implements  the  X86  instruction  set,  but  uses  a  RISC  in¬ 
struction  set  within  its  execution  core. 

We  use  type  classes  to  facilitate  the  description  of  circuits 
that  operate  over  all  instruction  sets.  For  example,  the  type 
of  an  ALU  might  be: 

alu  ::  (Instruction  i,  Bits  w)  =>  (i,w,w)  ->  w 
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This  way  alu  can  be  used  in  a  X86  model  (where  v  is  set 
to  32-bit  words  and  i  to  X86  instructions)  or  a  64-bit  RISC 
instruction  set,  like  that  of  the  Alpha.  The  Bits  class  is  an 
extension  of  Haskell’s  Num  class  that  adds  operators  related 
to  word  size,  signedness,  etc.  The  Instruction  class  cap¬ 
tures  the  common  elements  between  different  instructions 
sets. 

With  common  architectural  characteristics  captured 
with  type  classes,  we  are  then  able  to  build  abstractions 
that  help  organize  microarchitectural  models.  For  example, 
transactions  [1,  23]  are  a  simple  yet  powerful  grouping  of 
control  and  data.  A  transaction  is  a  machine  instruction 
grouped  together  with  its  state.  This  state  might  include: 

•  Operand  values. 

•  A  flag  indicating  that  the  instruction  has  caused  an 
exception. 

•  A  predicted  jump  target,  if  the  instruction  is  a  branch. 

Microarchitectures  models  that  utilize  transactions  can  then 
make  decisions  locally  rather  than  with  a  seperate  control 
unit. 

Hawk  provides  a  library  of  functions  for  creating  and 
modifying  transactions.  For  example,  bypass  takes  two 
transactions  and  builds  a  new  transaction  where  the  val¬ 
ues  from  the  destination  operands  of  the  first  transaction 
are  forwarded  to  the  source  operands  of  the  second.  If  i  is 
the  transaction: 

(r4,8)  <-  (r2 ,4)  +  (rl,4) 
and  j  is  the  transaction: 

rlO  <-  (r4,6)  +  (rl,4) 
then  bypass  i  j  produces  the  transaction: 

rlO  <-  (r4,8)  +  (rl,4) 

That  is,  bypass  inserted  i’s  more  recent  valuation  of  r4  into 
the  destination  operand  of  j. 

By  parameterizing  over  the  instances  of  finite  words  and 
registers: 

bypass  ::  (Bits  v,  Register  r)  -> 

Trams  i  r  v  ->  Trans  i  r  v  ->  Trans  i  r  w 

bypass  can  be  used  in  many  contexts.  Within  our  Pentium 
Ill-like  microarchitectural  model  we  use  bypass  on  both  in¬ 
structions  with  real  register  references  and  virtual  register 
references  (both  are  instances  of  the  type  class  Register). 
In  our  Merced-like  model  [6],  we  use  the  same  bypass  with 
IA-64  instructions. 

5  Lazy  state:  using  state-based  components 

There  has  been  debate  in  the  Haskell  community  about  the 
merits  of  strictness  within  the  state  monad.  In  this  section 
we  describe  an  application  where  a  lazy  state  monad  is  the 
right  thing. 

Some  microarchitectural  components,  such  as  register 
files,  are  more  naturally  (and  effeciently)  presented  a s  state 
transition  systems  than  list  transformers.  Fortunately,  we 
can  easily  embed  state-based  models  into  the  list  trans¬ 
former  idiom  using  the  lazy  state  monad  and  runST  [19]. 

Imagine  modelling  a  register  file  as  an  array  which,  on 
each  clock  tick,  is  both  written  to  and  read  from. 


reg  ::  Register  r  =>  Signal  (r,w)  ->  Signal  r  -> 
Signal  w 

reg  writes  reads 
*  runST  ( 

do  {  reg  <-  newArray  (minAddr,  maxAddr)  init 
;  loopST  (regFile  reg)  (bundle  writes  reads) 
> 

) 

regFile  ::  STArray  s  Addr  Val  ->  ((Addr, Val),  Addr) 
->  ST  s  Val 
regFile  reg  ((a,w),r) 

*  do  {  writeArray  reg  a  w 
;  readArray  reg  r 
> 

where  loopST  is  a  monadic  map  on  signals: 

loopST  : :  (a  ->  ST  s  b)  ->  Signal  a 
->  ST  s  (Signal  b) 

The  semantics  of  lazy  state  is  as  follows.  The  monadic 
structure  sequent ializes  the  operations  of  the  monad  but 
forces  nothing .  As  the  result  of  the  state  thread  is  de¬ 
manded,  so  execution  proceeds,  but  in  the  order  determined 
by  the  monadic  sequentialization.  Thus  execution  proceeds 
on  demand,  but  some  of  that  demand  is  transmitted  through 
the  state  sequencer. 

The  state  within  the  scope  of  runST  is  completely  hid¬ 
den  from  the  outside  world.  Thus  as  far  as  the  rest  of  the 
program  is  concerned,  reg  is  completely  pure,  as  indicated 
by  its  type.  The  encapsulation  of  the  state  occurs  because 
of  the  type  of  runST.  Inside  the  implementation  of  regFile, 
however,  the  situation  is  quite  different.  The  array  writes 
are  “imperative” ,  having  effects  immediately  visible  to  sub¬ 
sequent  reads. 

In  the  use  of  loopST  above,  the  state  machine  is  executed 
step  by  step,  consuming  its  list  input  and  generating  its  list 
output  on  the  way.  In  particular,  the  loop  construct  did 
not  attempt  to  execute  the  state  machine  completely  before 
releasing  the  output  list.  It  is  this  behavior  we  require  of  the 
state  monad  and,  fortunately,  though  not  officially  a  part  of 
Haskell,  most  implementations  provide  it. 

6  Monitoring  circuits  with  unsafePerformlQ 

When  embedding  a  language,  one  often  needs  “language 
primitives”  that  provide  good  things  in  bad  ways.  Fran  for 
example,  has  a  function  : 

importBitmap  : :  Filename  ->  Bitmap 

which  imports  a  bitmap  file  in  the  10  monad  but  uses 
uns  afeP  erf  oral  0  to  treat  the  bitmap  as  a  pure  value. 

When  using  Hawk  we  find  that  one  often  wants  to  ob¬ 
serve  the  values  flowing  across  a  signal.  Unfortunately, 
Haskell’s  semantic  purity  makes  this  viewing  rather  diffi¬ 
cult.  Often,  without  re-coding  a  model,  it  is  not  possible  to 
observe  the  signal.  Therefore  we  provide  the  function: 

probe  : :  Filename  ->  Signal  a  ->  Signal  a 

As  far  as  Hawk-level  models  are  concerned,  a  probe  is  simply 
an  identity.  However,  the  external  world  receives  a  differ¬ 
ent  view.  Probes  axe  fundamentally  side-effecting,  writing 
values  to  a  file,  even  though  they  apparently  have  a  pure 
type.  Thus  probes  cannot  be  defined  within  Haskell-proper. 
Instead,  they  required  some  Haskell  system  hacking  through 
the  use  of  unsafePerformlQ. 
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probe  name  vals  *  zipWith  (write  name)  [1..]  vals 

write  name  clock  val  =  unsaf ePerformlO 
do  {  h  <-  openFile  name  AppendMode 

;  hPutStrLn  h  (show  clock  ++  " ++  outp  val) 
;  hClose  h 
;  return  val 
> 

Notice  that  we  are  careful  not  to  change  the  strictness  of 
lazy  lists. 

We  have  found  that  unsaf ePerf  ormlO  is  a  powerful  fa¬ 
cility  for  building  of  domain-specific  tools  that  observe,  but 
do  not  affect  the  microarchitectural  models. 

7  Verification  in  Hawk 

The  past  several  sections  have,  one-by-one,  demonstrated 
the  usefulness  of  lazy  lists,  type  classes,  the  state  monad, 
and  unsaf  ePerf  ormlO.  In  this  section  we  discuss  a  particu¬ 
larly  exciting  application  that  requires  all  four  features. 

Hawk  provides  tools  that  can  be  used  to  formally  ver¬ 
ify  properties  of  models.  Suppose  that  we  want  to  prove 
the  following  properties  about  the  resettable  counter  from 
Section  3: 

1.  when  the  reset  line  is  low  on  the  next  clock  cycle,  the 
output  is  the  value  at  the  current  cycle  plus  1, 

2.  and  when  the  reset  line  is  high  at  the  current  clock 
cycle,  the  output  is  zero. 

In  Hawk,  we  might  express  these  properties  as  follows. 
Assume  that  rO  and  rl  are  the  values  of  the  reset  line  at 
time  t  and  f  +  1  respectively,  and  that  n  and  m  are  the 
corresponding  outputs. 

prop_counter  =  prop.one  &&  prop_two 
where 

prop_one  =  not  rl  ==>  (n  +  1  ===  m) 
prop_tvo  =  rO  -->  (n  -==  0) 

The  trick  is  to  show  that  these  properties  hold  for  arbitrary 
values  of  rO  and  rl.  To  do  that,  we  will  use  symbolic  values 
for  rO  and  rl,  and  symbolically  simulate  the  circuit. 

The  approach  we  take  to  symbolic  simulation  [7]  is 
straightforward.  Take  a  sufficiently  polymorphic  function, 
and  instantiate  it  at  a  symbolic  datatype.  What  we  mean 
by  a  symbolic  datatype  is  any  datatype  that  is  enriched 
with  variables  and  additional  term  structure.  For  example, 
we  have  used  the  following  datatype  for  symbolic  simulation 
of  simple  arithmetic  circuits. 

data  Symbo  a  = 

Const  a 
I  Vax  String 

I  Plus  (Symbo  a)  (Symbo  a) 

I  Times  (Symbo  a)  (Symbo  a) 

The  catch  is  that  some  care  is  required  in  making  func¬ 
tions  “sufficiently”  polymorphic.  This  means  that  over  the 
parts  of  the  program  that  you  wish  to  symbolically  evaluate, 
you  cannot  use  concrete  types,  because  those  types  must  be 
able  to  become  symbolic. 


7.1  Fitting  symbolic  simulation  into  Haskell 

In  places,  such  as  with  the  Num  class,  Haskell’s  prelude  is 
remarkably  amenable  to  symbolic  simulation.  In  others  it 
is  not.  As  an  example,  consider  Booleans.  To  capture  the 
operations  of  both  concrete  and  symbolic  Booleans  we  have 
defined  a  class  Boolean,  which  makes  all  the  boolean  oper¬ 
ators  from  the  prelude  abstract: 

class  Boolean  b  where 
true  : :  b 
false  : :  b 
(&fe)  : :  b  ->  b  ->  b 
(II)  ::  b  ->  b  ->  b 
(==»  : :  b  ->  b  ->  b 
not  : :  b  ->  b 

We  have  also  defined  the  class  Eql,  which  is  like  the 
standard  Eq  class,  except  that  it  is  also  abstracted  over  the 
result  type  for  equality,  resulting  in  a  multi-parameter  type 
class: 

class  Eql  a  b  where 

(===)  : :  a  ->  a  ->  b 

Conditional  expressions,  too,  must  be  abstract: 

class  Mux  c  a  where 

mux  : :  c  ->  a  ->  a  ->  a 

If  the  condition  on  which  we  branch  is  symbolic,  then  it  is 
clear  that  the  result  must  be  symbolic  as  well.  Hence  there 
is  a  relationship  between  the  type  of  the  conditional,  and 
the  type  of  the  result — just  the  sort  of  thing  that  multi¬ 
parameter  type  classes  express  well. 

To  capture  the  common  usage  of  conditional  expressions, 
we  make  Bool  an  instance  of  Mux 

instance  Mux  Bool  a  where 

mux  x  y  z  =  if  x  then  y  else  z 

We  can  now  employ  many  implementations  of  Booleans. 
In  particular  we  can  use  binary  decision  diagrams  (BDDs) 
[3],  which  implement  semantic  equality  between  symbolic 
boolean  expressions  in  constant  time.  Using  H/Direct  [11], 
the  state  monad  and  unsafePerformlO,  we  have  imported 
the  CMU  BDD  package  into  Haskell.  In  the  style  of  the 
modelling  language  of  Voss  [26],  Hawk  treats  BDDs  just  like 
Booleans.  But,  thanks  to  type  classes,  a  user  can  also  choose 
not  to  use  BDDs  —  so  long  as  their  choice  is  an  instance  of 
Boolean. 

7.2  Proving  a  property 

We  now  have  the  infrastructure  to  verify  our  properties.  Our 
strategy  is  to  simulate  the  counter  with  symbolic  values  on 
the  reset  line  for  the  first  two  ticks,  and  then  test  the  desired 
property  on  the  first  two  outputs.  We  have  made  the  initial 
value  of  the  delay  in  the  counter  an  additional  parameter  so 
that  we  can  place  a  symbolic  value  there  as  well.  This  makes 
our  test  independent  of  the  internal  state  of  the  counter,  and 
thus  makes  it  valid  to  test  the  properties  only  at  the  first 
two  clock  ticks. 

test  : :  BDD 

test  *  prop_one  &&  prop.two 
where 

a  =  var  "a"  : :  BDD_Vector8 
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rO  =  var  "rO"  ::  BDD 

rl  =  var  "rl"  ::  BDD 

reset  : :  Signal  BDD 

reset  =  rl  ‘delay*  rl  ‘delay*  false 

[n,  m]  -  counter  a  reset  <0(3®  [0,  1] 

prop_one  =  not  rl  ==>  (n  +  1  =-=  m) 

prop_tvo  *  rO  ==>  (n  ===  0) 

(®Q®  is  an  operator  for  sampling  a  signal  at  the  specified 
times.)  By  evaluating  test  we  are  proving  that,  for  Boolean 
vectors  of  length  8,  the  counter  circuit  meets  our  specifica¬ 
tion.  Using  types  more  general  that  BDD_Vector8,  we  can 
prove  the  properties  for  counters  of  arbitrary  size. 

8  Where  Haskell  and  Hawk  tangle 

For  our  domain,  Haskell  has  turned  out  to  be  an  excellent 
tool  for  experimenting  with  language  design.  However,  in  a 
few  places,  Haskell  is  not  a  perfect  match.  In  this  section 
we  review  our  use  of  lazy  lists,  type  classes,  the  lazy  state 
monad,  and  unsaf  ePerf  ormlO  and  point  to  the  hinderences 
that  we  have  encountered. 

8.1  Lazy  Lists 

In  some  cases  Haskell  is  a  little  too  generous.  Our  preferred 
semantics  for  signals  is  that  of  truly  infinite,  or  coinduct ive, 
lists— i.e.,  not  that  of  finite,  infinite,  and  partially  defined 
lists,  as  in  Haskell.  Any  feedback  loop  that  did  not  include 
at  least  one  delay  should  be  rejected  as  being  ill-defined. 
Haskell,  however,  will  stubbornly  do  its  best  to  make  sense 
of  even  such  ill-defined  definitions.  Could  Haskell  do  bet¬ 
ter?  We  have  constructed  a  shallow  embedding  of  Hawk  in 
Isabelle  [25],  which  is  much  less  forgiving.  In  order  to  have 
Isabelle  accept  our  recursive  definitions  we  have  had  to  de¬ 
velop  a  richer  theory  of  induction  over  coinductive  datatypes 
them  previously  available  [21].  Using  this  theory,  Isabelle  is 
able  to  accept  all  the  valid  Hawk  definitions  that  we  have 
thrown  at  it,  while  rejecting  the  invalid  ones.  It  would  be 
useful  if  Haskell’s  type  system  could  be  extended  to  handle 
this— perhaps  using  unpointed  types  [20]  to  express  valid 
coinductive  definitions. 

8.2  Type  Classes 

Because  the  type  representing  an  instruction  set  must  re¬ 
main  abstract,  we  cannot  directly  pattern  match  on  it.  In¬ 
stead,  the  operations  of  the  Instruction  class  provide  pred¬ 
icates  to  identify  common  instructions  such  as  nops,  arith¬ 
metic  ops,  loads  and  stores  and  jumps. 

class  (Show  i,  Eq  i)  e>  Instruction  i  where 
isNoOp  : :  i  ->  Bool 
isAddOp  : :  i  ->  Bool 
isSubOp  : :  i  ->  Bool 

If  Haskell  allowed  arbitrary  views  of  datatypes  [30],  then 
this  could  be  handled  much  more  nicely. 

8.3  The  State  Monad 

Haskell’s  syntactic  support  for  state  is  not  a  perfect  fit. 
First,  Haskell  has  no  way  to  declare  storage  statically,  but 
this  is  exactly  what  is  required.  In  the  register  example,  the 


array  is  allocated  at  the  beginning,  and  nothing  else  is  al¬ 
located  afterwards.  Since  silicon  cannot  be  allocated  on  the 
fly,  when  we  come  to  consider  other  interpretations  of  Hawk 
models,  it  would  be  useful  to  guarantee  that  the  body  of  the 
state  code  did  not  affect  the  shape  of  the  store,  merely  its 
contents. 

Secondly,  in  our  microarchitectural  models,  the  pattern 
loopST  f  (bundle  xs  ys)  occurs  often  enough  to  want  a 
language  construct  to  describe  it.  Putting  these  ideas  to¬ 
gether,  we  may  ideally  wish  to  write  something  like: 

reg  writes  reads 

=  runST  (do  {array  reg  (minAddr,  maxAddr)  =  init 
;  loop  (w<-writes,  r<-reads) 

{  writeArray  reg  a  w 
;  readArray  reg  r 
> 

> 

) 

8.4  Using  unsafePerf ormlO 

Probes  often  work  quite  well,  but  there  are  some  glitches. 
While  we  have  been  careful  to  preserve  the  semantics  of 
Haskell  in  introducing  probes,  the  semantics  of  probes  are 
not  really  preserved  by  Haskell.  Due  to  lazy  evaluation, 
there’s  nothing  to  assure  that  probe  output  will  appear  in 
the  order  expected.  The  output  of  a  probe  at  clock  tick  9 
might  be  put  in  the  file  before  the  output  of  a  probe  at  clock 
tick  7.  Another,  glitch  is  that,  in  a  model,  we  are  free  to 
use  a  given  unit  more  than  once.  But  if  that  unit  has  an 
embedded  probe,  you  will  get  the  output  of  both  probes  in 
the  file.  This  is  not  problematic,  except  that  you  have  no 
way  of  identifying  which  output  is  from  which  probe. 

But  these  problems  have  less  to  do  with  the  perhaps  un¬ 
scrupulous  nature  of  using  unsafePerformlO,  and  more  to 
do  with  a  shortcoming  in  our  overall  design.  In  the  sec¬ 
tion  on  future  work,  we  will  discuss  an  approach  that  will 
mitigate  these  problems. 

8.5  Symbolic  simulation 

Our  drive  to  make  the  entire  Hawk  library  sufficiently 
polymorphic  to  perform  symbolic  evaluation  has  made  us 
painfully  aware  of  the  shortcomings  of  Haskell’s  type  class 
system  in  describing  abstract  data  types.  Haskell’s  module 
system  can  be  used  in  a  limited  way  to  effect  abstraction, 
as  we  have  used  for  the  signal  type.  But  Haskell’s  module 
system  is  only  intended  as  name  space  management,  and  is 
a  poor  match  when  you  intend  to  use  abstract  types  instan¬ 
tiated  at  many  different  types. 

The  type  class  system  at  times  works  brilliantly.  And 
what  is  most  impressive  is  how  well  it  has  worked  for  us, 
as  we  use  it  for  tasks  far  beyond  its  original  intended  use 
(simply  as  a  system  of  overloading).  However,  the  fit  is  not 
always  perfect.  One  place  is  the  lack  of  explicit  control  over 
instancing.  One  of  the  neat  aspects  of  symbolic  evaluation 
is  that  it  allows  us  to  take  an  existing  executable  model 
and  verify  properties  of  it,  without  changing  the  model  at 
all.  However,  this  does  not  work  quite  as  well  as  it  could 
because  of  limitations  in  the  class  system.  Ideally,  we  would 
like  to  instantiate  test  above  at  different  symbolic  types. 
However,  there  is  no  good  way  to  parameterize  test  by  the 
types  in  question,  without  resorting  to  unpleasantries  like 
adding  dummy  arguments.  The  type  of  the  counter  data 
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is  purely  an  intermediate  value  in  the  definition  of  test. 
If  we  were  not  specific  about  the  type  of  a,  Haskell  would 
consider  the  declaration  ambiguous.  Here  we  are  limited 
by  the  type  class  system’s  restriction  to  type  inference — the 
programmer  is  given  no  tool  to  resolve  the  ambiguity.  Just 
as  type  inference  can  be  augmented  by  type  annotations 
to  help  the  type  system  where  it  can’t  help  itself,  as  with 
polymorphic  recursion,  we  should  be  able  to  provide  some 
sort  of  annotation  to  help  Haskell  resolve  ambiguous  uses  of 
type  classes. 

9  Future  work 

9.1  Verification 

One  of  the  unsatisfying  aspects  of  the  verification  exam¬ 
ple  is  that  it  was  necessary  to  make  the  internal  state 
of  the  counter  an  explicit  parameter.  Doing  this  in  a 
complex  model  would  entail  passing  around  a  lot  of  extra 
parameters— just  the  sort  of  thing  we’d  like  to  avoid.  Also, 
in  forcing  the  model  to  be  explicit  about  its  internal  state,  it 
also  undercuts  one  of  the  strengths  of  the  signal  transformer 
model  that  sets  it  apart  from  state  transformer  models,  mak¬ 
ing  it  a  sort  of  unwelcome  hybrid. 

However,  using  ideas  from  Symbolic  Trajectory  Evalua¬ 
tion  [14],  we  are  currently  working  with  symbolic  domains 
that  have  a  partial  order  structure.  Symbolic  simulation 
proceeds  by  starting  with  initial  states  set  to  bottom,  with 
iteration  of  the  model  gradually  adding  more  information. 

We  are  also  currently  applying  symbolic  simulation  to 
simple  pipelined  microarchitectures  to  verify  correctness 
of  hazard  avoidance,  using  a  self-consistency  checking  ap¬ 
proach  [16].  The  technique  is  to  simulate  a  stream  of  sym¬ 
bolic  instructions  two  times.  Let  us  assume  that  the  pipeline 
has  two  stages.  In  the  first  case,  we  feed  two  symbolic  in¬ 
structions  followed  by  a  no-op.  In  the  second  case,  we  feed 
the  same  two  symbolic  instructions  separated  by  the  no-op. 
The  test  is  that  the  contents  of  the  registers  is  the  same  after 
the  third  instruction,  demonstrating  that  the  hazard  logic  is 
working  correctly. 

9.2  Elaboration  monads 

One  of  the  shortcomings  of  Hawk  is  that  it  has  no  explicit  no¬ 
tion  of  elaboration  separate  from  the  semantics  of  the  model. 
Elaboration  is  the  process  of  translating  a  possibly  higher- 
order  Hawk  circuit  into  a  first-order  description,  such  as  the 
hardware  languages  VHDL  or  Verilog.  This  was  not  always 
the  case.  Initially,  Hawk  was  similar  to  Lava  [2],  using  a 
monad  to  capture  circuit  elaboration.  The  monad  might 
be  used  to  generate  net-lists  for  the  purposes  of  fabrication, 
or  it  might  produce  logical  formulae  for  input  to  a  theorem 
prover.  For  simulation,  the  monad  is  essentially  the  identity 
monad,  since  all  we  have  to  do  is  glue  together  functions. 
However,  during  simulation,  the  monad  could  also  provide 
the  service  of,  for  example,  splitting  probes  that  get  dupli¬ 
cated. 

One  reason  that  we  departed  from  an  explicit  monadic 
style  is  that  the  mutually  recursive  streams  idiom  that  works 
so  well  is  not  supported  by  the  do  notation.  What  we  pro¬ 
pose  is  to  extend  the  do  notation  so  that  bindings  are  recur¬ 
sive. 
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Abstract.  We  describe  a  set  of  remarkably  simple  algebraic  laws  gov¬ 
erning  microarchitect ural  components.  We  apply  these  laws  to  incremen¬ 
tally  transform  a  pipeline  containing  forwarding,  branch  speculation  and 
hazard  detection  so  that  all  pipeline  stages  and  forwarding  logic  are  re¬ 
moved.  The  resulting  unpipelined  machine  is  much  closer  to  the  reference 
architecture,  and  presumably  easier  to  verify. 


1  Introduction 

Transformational  laws  are  well  known  in  digital  hardware,  and  form  the  basis  of 
logic  simplification  and  minimization,  and  of  many  retiming  algorithms.  Tradi¬ 
tionally,  these  laws  occur  the  gate  level:  de  Morgan’s  law  being  a  classic  example. 
In  this  paper,  we  examine  whether  corresponding  transformational  laws  hold  at 
the  microarchitectural  level. 

A  priori,  there  is  no  reason  to  think  that  large  microarchitectural  components 
should  satisfy  any  interesting  algebraic  laws,  as  they  are  constructed  from  thou¬ 
sands  of  individual  gates.  Boundary  cases  could  easily  remove  any  uniformity 
that  has  to  exist  for  simple  laws  to  be  present.  Yet  we  have  found  that  when 
microarchitectural  units  are  presented  in  a  particular  way,  many  powerful  laws 
appear.  Moreover,  as  we  demonstrate  in  this  paper,  these  laws  by  themselves  are 
powerful  enough  to  allow  us  to  show  equivalence  of  pipelined  and  non-pipelined 
microarchitectures. 

We  have  used  this  algebraic  approach  to  simplify  a  pipelined  microarchi¬ 
tecture  that  uses  forwarding,  branch  speculation  and  pipeline  stalling  for  haz¬ 
ards.  The  resulting  pipeline  is  very  similar  to  the  reference  machine  specification 
(i.e.  no  forwarding  logic),  while  still  retaining  cycle-accurate  behavior  with  the 
original  implementation  pipeline.  The  top-level  transformation  proof  is  simple 
enough  to  be  carried  out  on  paper,  but  we  have  mechanized  enough  of  the  theory 
in  the  Isabelle  theorem  prover  [20]  to  have  verified  it  semi-automatically,  using 
Isabelle’s  powerful  rewriting  engine. 

Interestingly,  both  circuits  and  laws  can  be  expressed  diagrammatically.  A 
paper  proof  (transformation  using  equivalence  laws)  proceeds  as  a  series  of  mi¬ 
croarchitecture  block  diagrams,  each  an  incrementally  transformed  version  of  the 
last.  The  laws  often  have  a  geometric  flavor  to  them,  such  as  laws  to  swap  two 


components  with  each  other,  or  laws  to  absorb  one  component  into  another.  We 
find  this  diagrammatic  approach  an  excellent  way  to  communicate  proofs. 

For  us,  the  most  time-consuming  part  of  this  technique  has  been  discovering 
the  local  behavior-preserving  laws.  It  is  our  experience  that  these  laws  are  much 
easier  to  discover  when  one  uses  the  right  level  of  abstraction.  In  particular, 
we  encapsulate  all  control  and  dataflow  information  concerning  a  given  instruc¬ 
tion  in  the  pipeline  into  an  abstract  data  type  called  a  transaction  [1,17].  We 
have  found  that  not  only  do  transactions  reduce  the  size  of  microarchitecture 
specifications,  they  also  provide  enough  “auxiliary”  state  information  to  make 
law-discovery  practical. 

The  rest  of  the  paper  gives  a  brief  introduction  to  our  specification  language, 
and  then  discusses  many  of  the  laws  we  have  discovered.  We  then  show  their  use 
by  applying  the  laws  in  a  proof  of  equivalence  between  two  microarchitectures. 
While  space  constraints  prohibit  us  from  giving  the  complete  proof,  the  top-level 
proof  is  sketched  diagrammatically  in  [16]. 

2  Specifying  a  Pipelined  Microarchitecture 

We  specify  microarchitectures  using  the  Hawk  language  [4, 17].  Hawk  allows  us 
to  express  modern  microarchitectures  clearly  and  concisely,  to  simulate  the  mi¬ 
croarchitectures,  either  directly  with  concrete  values,  or  symbolically,  and  pro¬ 
vides  a  formal  basis  for  reasoning  about  their  behavior  at  source-code  level. 
Currently  Hawk  is  a  set  of  libraries  built  on  top  of  the  pure  functional  language 
Haskell,  which  is  strongly  typed,  supports  first-class  functions,  and  infinite  data 
structures,  such  as  streams  [8,21].  It  is  this  legacy  that  led  us  to  look  for  trans¬ 
formation  laws  in  the  first  place:  one  often-cited  benefit  of  purely  functional 
programs  is  that  they  are  amenable  to  verification  through  equational  reason¬ 
ing.  We  wanted  to  see  if  such  algebraic  techniques  scaled  up  to  microarchitectural 
verification. 

2.1  Hawk  Signals 

Hawk  is  a  purely  declarative  synchronous  specification  language,  sharing  a  se¬ 
mantic  base  similar  to  Lustre[7].  The  basic  data  structure  underlying  Hawk  is 
the  signal ,  which  can  be  thought  of  as  an  infinite  sequence  of  values,  one  per 
clock  cycle,  and  circuits  are  pure  functions  from  input  signals  to  output  signals. 
The  elements  of  a  signal  must  belong  to  the  same  type. 

We  use  a  notion  of  transactions  to  specify  the  immediate  state  of  an  en¬ 
tire  instruction  as  it  travels  through  the  microprocessor  [1].  A  transaction  is  a 
record  with  fields  containing  the  instruction’s  opcode,  source  register  names  and 
values,  and  the  destination  register  name  and  its  value,  plus  any  additional  in¬ 
formation,  like  the  speculative  branch  target  PC  for  each  branching  instruction. 
A  microarchitecture  is  a  network  of  components,  each  of  which  processes  signals 
of  transactions. 


Figure  1  shows  the  diagram  of  a  simple  one-stage  microarchitecture,  built  out 
of  transaction  signal  processors.  Each  component  incrementally  assigns  values  to 
various  transaction  fields,  based  on  the  component’s  internal  state  (if  any)  and 
the  values  of  transaction  fields  assigned  by  earlier  components.  A  textual  Hawk 
specification  of  this  circuit  consists  of  set  of  mutually-recursive  stream  equations 
between  the  components.  However,  in  this  paper  we  will  represent  Hawk  circuits 
as  diagrams. 

For  example,  the  regFile 
component  has  two  transac¬ 
tion  signal  inputs  and  one 
transaction  signal  output.  At 
a  given  clock  cycle,  the  first 
input  (called  regFileln  in 
Figure  1)  contains  a  trans-  Fig.  1.  One-stage  pipeline, 

action  whose  opcode  and  reg¬ 
ister  name  fields  have  been  initialized,  but  whose  value  fields  have  all  been  zeroed 
out.  The  second  input  (called  writeback)  contains  the  completed  transaction 
from  the  previous  clock  cycle.  The  regFile  component  first  updates  its  internal 
register  file  state,  based  on  the  destination  register  name  and  value  fields  of  the 
writeback  input.  It  then  fills  in  the  source  operand  value  fields  of  the  regFileln 
transaction  based  on  the  corresponding  operand  register  names  and  the  updated 
register  file,  and  outputs  the  filled  in  transaction,  all  within  the  same  clock  cycle. 

The  alu  component  examines  the  opcode  and  source  operand  value  fields  of 
the  transaction  output  by  regFile.  If  the  opcode  is  an  ALU  operation  (which 
include  branch  instructions),  the  alu  component  computes  the  appropriate  re¬ 
sult,  assigns  the  result  to  the  destination  operand  value  field  of  the  transaction, 
and  outputs  the  transaction  along  the  memln  wire,  again  within  the  same  (long) 
clock  cycle.  If  the  opcode  is  not  an  ALU  operation,  the  alu  component  outputs 
the  transaction  unchanged. 

The  mem  component  behaves  similarly  for  memory  load  and  store  operations. 
Like  the  regFile  component,  the  mem  component  has  internal  state,  representing 
the  contents  of  data  memory  at  each  clock  cycle.  This  state  is  updated  and 
referenced  based  on  the  transactions  sent  to  the  mem  component.  Just  as  with 
the  alu  component,  all  memory  and  transaction  updating  occurs  within  the 
same  clock  cycle.  The  mem  component  sends  the  completed  transaction  to  a  delay 
component  (represented  in  our  diagrams  as  a  shaded  box),  to  make  it  available  to 
the  I  Cache  and  regFile  components  in  the  next  clock  cycle.  These  transactions 
also  become  the  output  of  the  entire  microarchitecture,  as  is  shown  by  the  right¬ 
most  arrow.  The  initial  value  output  by  the  delay  component  is  the  default 
transaction  nopTrans,  which  represents  an  “inert”  transaction  which  behaves 
like  a  NOP  instruction,  but  does  not  affect  the  ICache’s  program  counter. 

The  ICache  component  produces  new  transactions,  based  on  the  value  of  the 
current  program  counter  and  the  contents  of  program  memory  (the  instruction- 
set  architectures  we  consider  have  separate  address  spaces  for  instructions  and 
data).  Both  the  current  PC  and  the  instruction  memory  contents  are  internal 
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to  ICache.  The  ICache  takes  on  its  writeback  input  the  completed  transaction 
from  the  previous  clock  cycle.  The  ICache  examines  the  transaction  for  branches 
that  have  been  taken.  When  it  finds  such  an  instruction,  it  modifies  its  internal 
PC  accordingly  and  starts  fetching  transactions  from  the  branch  target  address. 
The  ICache  has  as  output  a  signal  of  transactions  representing  the  newly-fetched 
instructions.  Each  transaction’s  source  and  destination  operand  values  are  ini¬ 
tialized  to  zero,  since  the  ICache  doesn’t  know  what  values  they  should  have. 
The  other  pipeline  components  will  fill  in  these  fields  with  their  correct  values. 
The  ICache  has  a  second  input,  called  stall,  which  is  a  signal  of  Boolean  values. 
On  clock  cycles  where  stall  is  asserted,  the  ICache  will  output  the  same  trans¬ 
action  as  it  did  on  the  previous  clock  cycle.  In  this  simple  microarchitecture, 
stall  is  always  false.  In  more  complex  pipelines,  the  stall  signal  is  typically 
asserted  when  the  pipeline  needs  to  stall  due  to  a  branch  misprediction. 

For  more  complex  pipelines,  we  also  allow  the  ICache  to  perform  branch 
prediction,  based  on  an  internal  branch  target  buffer.  When  performing  branch 
prediction,  the  ICache  will  also  annotate  branch  instruction  transactions  with  the 
predicted  branch  target  PC.  A  branchnnisp  component  (not  shown  in  Figure  1) 
can  locally  compare  the  predicted  branch  target  with  the  actual  branch  target 
to  determine  if  a  branch  misprediction  has  occurred. 


3  Microarchitecture  Laws 


With  any  algebraic  reasoning  there 
need  to  be  some  ground  rules.  We  take 
as  fundamental  the  notion  of  referen¬ 
tial  transparency  or,  in  hardware  terms, 
a  circuit  duplication  law.  Any  circuit 
Fig.  2.  Universal  circuit-duplication  whose  output  is  used  in  multiple  places 

*aw  is  equivalent  to  duplicating  the  circuit 

itself,  and  using  each  output  once.  This  law  is  shown  graphically  in  Figure  2. 
Because  of  the  declarative  nature  of  our  specification  language,  every  circuit 
satisfies  this  law.  That  is,  it  is  impossible  within  Hawk  for  a  specification  of  a 
component  to  cause  hidden  side-effects  observable  to  any  other  component  spec¬ 
ification.  In  many  specification  languages  this  law  does  not  hold  universally.  For 
example,  duplicating  a  circuit  that  incremented  a  global  variable  on  every  clock 
cycle  would  cause  the  global  variable  to  be  incremented  multiple  times  per  clock 
period,  breaking  behavioral  equivalence.  Hawk  circuits  can  still  be  stateful,  but 
all  stateful  behavior  must  be  local  and/or  expressed  using  feedback. 

The  next  few  sections  introduce  many  other  laws,  some  of  which  are  specific  to 
particular  combinations  of  components,  while  others  are  quite  widely  applicable. 
Each  instantiation  of  a  law  needs  to  be  proved  with  respect  to  the  specification 
of  the  circuit  components  involved.  We  have  found  induction  and  bisimulation 
to  be  the  most  useful  ways  of  proving  the  laws  in  this  paper,  expressed  as  proofs 
in  Isabelle. 


3.1  Delay  Laws 


The  delay  circuit  is  a  fundamen¬ 
tal  building  block  of  clocked  cir¬ 
cuits,  especially  when  combined  with 
feedback.  A  feedback  variant  of  the 
circuit  duplication  law  shown  in  Fig- 
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Fig.  3.  feedback  rotation  law 
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ure  3,  called  the  feedback  rotation  law,  allows  circuits  to  be  split  along  feedback 
wires.  This  law  is  not  universal,  but  it  is  valid  for  any  circuit  that  does  not 
contain  zero-delay  cycles  (amongst  others).  Happily,  all  of  the  laws  we  discuss, 
including  the  feedback  rotation  law  itself,  preserve  a  well-formedness  property: 
if  a  circuit  contains  no  zero-delay  cycles,  then  any  transformed  circuit  will  also 
have  no  zero-delay  cycles. 

The  time-invariance  law  (Fig¬ 
ure  4)  is  also  nearly  universal.  A 
circuit  is  time-invariant  if  one  can 
retime  the  circuit  by  removing  the 
delays  from  all  the  inputs  of  the 
circuit  and  placing  new  delays  on 
the  circuit’s  outputs.  Any  combi¬ 
natorial  circuit  that  preserves  de-  ^*6-  4.  time-invariance  law. 

fault  values  is  automatically  time-invariant,  but  so  are  stateful  circuits  like  the 
register  file  and  memory  cache.  Interestingly,  the  ICache  is  not. 

We  use  the  above  laws  extensively  to  remove  pipeline  stages.  If  a  pipeline 
stage  is  time-invariant,  then  we  can  move  the  pipeline  registers  (represented 
as  delay  circuits)  from  before  the  pipeline  stage  to  afterwards.  If  subsequent 
pipeline  stage  are  also  time-invariant,  then  we  can  repeat  the  process,  eventually 
moving  all  of  the  delay  circuits  to  the  end  of  the  pipeline.  However,  forwarding 
logic  between  pipeline  stages  must  still  access  the  appropriate  time-delayed  out¬ 
puts  of  later  pipeline  stages.  The  feedback-rotation  law  polices  this,  and  ensures 
that  the  appropriate  time-delay  is  kept  by  forcing  delays  to  be  inserted  on  all 
feedback  wires  to  the  forwarding  circuits. 


3.2  Bypasses  and  Bypass  Laws 

The  purpose  of  forwarding  logic  in  a  pipeline  is  to  ensure  that  results  computed 
in  later  pipeline  stages  are  available  to  earlier  pipeline  stages  in  time  to  be 
used.  Conceptually,  the  forwarding  logic  at  each  pipeline  stage  examines  its 
current  instruction’s  source  operand  register  names  to  see  if  they  match  a  later 
stage’s  destination  operand  register  name.  For  every  matching  source  operand, 
the  operand  value  is  replaced  with  the  result  value  computed  by  the  later  pipeline 
stage.  Non-matching  source  operands  continue  to  use  operand  values  given  by 
the  preceding  pipeline  stage. 


This  conceptual  logic  can  be  implemented  con- 
| update  cisely  using  transactions.  A  bypass  circuit  (Figure  5) 

inp  oul  has  two  inputs,  each  a  signal  of  transactions:  The 

**  first  input  (inp)  contains  the  transactions  from  the 

preceding  pipeline  stage.  The  second  input  (update) 
Fig.  5.  bypass  circuit  contains  the  transactions  from  a  subsequent  pipeline 

stage.  The  bypass  circuit  at  each  clock  cycle  com¬ 
pares  the  source  operand  names  of  the  current  inp  transaction  with  the  desti¬ 
nation  operand  names  of  the  current  update  transaction.  The  output  of  bypass 
is  identical  to  inp,  except  that  source  operands  matching  update’s  destination 
operand  are  updated.  Bypasses  arise  frequently  enough  in  pipeline  specifications 
that  we  draw  them  specially,  as  diamonds  with  the  update  input  connected  to 
either  the  top  or  the  bottom. 


Fig.  6.  bypass  circuit  idempotence  law 


Bypass  circuits  have  many  nice 
properties.  Not  only  are  they  time- 
invariant  and  obey  a  kind  of  idem¬ 
potence  (Figure  6),  but  they  also 
interact  closely  with  register  files 
and  various  execution  units. 


The  fundamental  interaction  be¬ 
tween  a  bypass  and  register  file  is 
shown  in  Figure  7.  We  call  this  the 
register-bypass  law ,  and  it  is  used 
repeatedly  in  eliminating  forward- 
Fig.  7.  register-bypass  law  ing  logic  when  simplifying  pipelines. 

The  law  states  that  we  can  delay 
writing  a  value  into  the  register  file,  so  long  as  we  also  forward  the  value  to  be 
written,  in  case  that  register  was  being  read  on  the  same  clock  cycle. 

Initially  we  considered  this  law  to  be  a  theorem  about  register  files,  and 
accordingly  we  proved  that  it  held  for  a  number  of  different  implementations. 
However,  it  is  also  tempting  to  view  this  law  as  an  axiom  of  register  files.  In 
effect,  by  using  the  law  repeatedly  from  right  to  left,  we  obtain  a  specification 
for  how  the  register  file  must  behave  for  any  time  prefix. 


Hazard  -  Bypass  Law  Another  bypass  law  permits  the  removal  of  bypasses 
between  execution  units.  It  is  often  the  case  that  after  retiming  all  delay  circuits 
to  the  end  of  a  pipeline,  two  execution  units  in  a  pipeline  (such  as  an  ALU 
unit  and  a  Load/Store  unit)  are  connected  with  one-cycle  feedback  loops.  Each 
bypass  circuit  is  forwarding  the  outputs  of  an  execution  unit  to  the  inputs  of 
that  same  execution  unit,  one  clock  cycle  later. 

If  the  upstream  pipeline  stages  can  guarantee  that  there  is  no  hazard  between 
successive  transactions,  then  the  double  feedback  is  equivalent  to  the  single  feed- 


back  circuit  shown  at  the  bottom  of  Figure  8.  This  (conditional)  identity  is  called 
the  hazard-bypass  law. 

To  be  more  concrete,  suppose 
execl  is  the  ALU  and  exec2  the 
memory  cache.  Then  an  ALU-mem 
hazard  arises  if  a  transaction  which 
loads  a  register  value  from  memory 
is  immediately  followed  by  an  ALU 
operation  which  requires  that  reg¬ 
ister’s  value.  Under  these  circum¬ 
stances  the  two  feedback  loops  would 
give  different  results.  Under  all  other 
circumstances  the  two  circuits  are 
equivalent.  We  express  this  condi¬ 
tional  equivalence  using  the  no  Jiaz  Fig.  8.  hazard-bypass  law 

component.  It  is  an  example  of  a 

projection  component  and  is  discussed  in  the  next  section. 

3.3  Projection  Laws 

Many  laws,  like  the  hazard-bypass  law  above,  require  that  the  input  signals 
satisfy  certain  properties,  and  commonly,  we  may  know  that  the  output  signal 
of  a  given  component  always  satisfies  a  particular  property.  We  can  capture  this 
knowledge  of  properties  using  signal  projections . 

A  signal  projection  is  a  component  with  one  input  and  one  output.  As  long 
as  the  input  signal  satisfies  the  property  of  interest,  the  component  acts  like  an 
identity  function,  returning  the  input  signal  unchanged.  However,  if  the  input 
does  not  satisfy  the  property  we  are  interested  in,  the  projection  component 
modifies  the  input  signal  in  some  arbitrary  way  so  that  the  property  is  satisfied. 

Let  us  consider  an  example.  For  the  hazard-bypass  law  we  are  interested  in 
expressing  the  absence  of  ALU-mem  hazards  in  a  transaction  signal.  We  reify 
this  property  as  a  no  Jiaz  projection.  On  each  clock  cycle,  the  no  Jiaz  component 
compares  the  current  input  transaction  with  the  previous  input  transaction.  If 
there  is  no  ALU-mem  hazard  between  the  two  transactions,  then  the  current 
transaction  is  output  unchanged.  If  a  hazard  does  exist,  then  no  Jiaz  will  instead 
output  nopTrans,  which  is  guaranteed  not  to  generate  a  hazard  (since  nopTrans 
contains  no  source  operands) . 

Where  do  projections  come  from?  After  all,  they  are  not  the  sort  of  compo¬ 
nent  that  microarchitectural  designers  introduce  just  for  fun. 

Fig  9  provides  an  example  of  a  law  which  “generates”  a  projection.  The 
hazard-squashing  logic  guarantees  that  its  output  contains  no  hazards,  and  this 
is  expressed  in  that  the  circuit  is  unchanged  when  the  noJiaz  component  is 
inserted  on  its  output. 

(The  hazard  component  outputs  a  Boolean  on  each  clock  cycle  stating  whether 
its  two  input  transactions  constitute  a  hazard.  The  kill  component  takes  a 
transaction  signal  and  a  Boolean  signal  as  inputs.  On  each  clock  cycle,  if  the 


Boolean  input  is  false,  then  kill  outputs  its  input  transaction  unchanged.  If  the 
Boolean  input  is  true,  then  kill  outputs  a  nopTrans,  effectively  “killing”  the 
input  transaction.) 


To  be  useful,  a  pro¬ 
jection  component  needs 
to  be  able  to  migrate  from 
a  source  circuit  that  pro¬ 
duces  it  (such  as  the  cir¬ 
cuit  in  Figure  9)  to  a  tar¬ 
get  circuit  that  needs  the 
projection  to  enable  an 
algebraic  law  (such  as  the 
hazard-bypass  law).  Thus  a  projection  component  must  be  able  to  commute  with 
the  intervening  circuits  between  the  source  and  the  target  circuit.  Well-designed 
projections  commute  with  many  circuits.  For  instance,  the  no_haz  projection 
commutes  with  bypass,  alu,  mem,  and  regFile  components.  It  also  commutes 
with  delay  components  (that  is,  noJiaz  is  time-invariant). 

Projections  are  also  convenient  for  expressing  the  fact  that  a  component 
only  uses  some  of  the  fields  of  an  input  transaction.  For  instance,  the  hazard 
component  only  looks  at  the  opcode,  source,  and  destination  register  name  fields 
of  its  two  input  transactions.  We  can  create  a  projection  called  proj.ctrl  that 
sets  every  other  field  of  a  transaction  to  a  default  value,  and  prove  a  law  stating 
that  the  hazard  component  is  unchanged  when  proj.ctrl  is  added  to  any  of 
its  inputs.  We  can  then  show  that  proj_ctrl  commutes  with  other  components, 
such  as  bypasses  and  delays.  This  allows  us  to  move  the  input  wires  to  hazard 
across  these  other  components,  which  is  sometimes  necessary  to  enable  other 
laws.  Similarly,  the  proj_branch_info  projection  allows  us  to  move  ICache  and 
branch_misp  component  inputs. 


hazard 


kill 


Fig.  9.  Hazard-squashing  logic  guarantees  no  haz¬ 
ards 


4  Transforming  the  Microarchitecture 

The  laws  we  have  been  discussing  can  be  used  for  aggressively  restructuring 
microarchitectures  while  retaining  equivalence.  We  have  used  them  to  simplify 
several  pipelined  microarchitectures  with  a  view  to  verification.  The  example 
we  present  here  contains  three  levels  of  forwarding  logic,  resolves  hazards  by 
stalling  the  pipeline,  and  performs  branch  speculation.  The  block  diagram  for 
this  microarchitecture  is  shown  in  Figure  10. 

By  using  just  algebraic  laws,  we  have  been  able  to  reduce  most  of  the  com¬ 
plexity,  leaving  essentially  an  unpipelined  microarchitecture.  We  are  currently 
implementing  the  algebraic  laws  as  a  rewrite  system  in  Isabelle.  For  this  paper 
we  describe  our  top-level  rewrite  strategy  informally. 


Retiming  We  first  remove  all  delay  circuits  from  the  main  pipeline  path.  We 
accomplish  this  by  repeatedly  applying  the  time-invariance  law,  and  by  splitting 
delays  along  wires  through  the  circuit  duplication  and  feedback  rotation  laws. 


Fig.  10.  Microaxchitecture  before  simplification 


Move  control  wires  Next,  we  move  all  wires  not  directly  involved  with  for¬ 
warding  logic  to  either  before  or  after  all  of  the  bypass  circuits.  This  is  to  enable 
the  hazard-bypass  laws,  which  we  apply  in  a  later  step.  We  move  the  wires  by  in¬ 
serting  projection  circuits  and  using  the  corresponding  projection-commutativity 
laws. 


Propagate  hazard  information  The  hazard-bypass  laws  can  only  be  ap¬ 
plied  when  there  are  no  hazards  between  the  affected  stages.  So  we  generate  a 
no-hazard  projection  at  the  end  of  the  dispatch  stage  (which  is  justified  by  a 
projection-absorption  law  applicable  to  the  kill-circuit  complex  in  that  stage), 
and  then  move  it  between  the  first  and  second  bypass  circuits.  We  also  use  addi¬ 
tional  properties  of  the  proj  _ctrl,  kill,  and  regFile  circuits  (discussed  in  [16]) 
to  swap  the  hazard/kill  complex  with  the  register  file,  so  that  the  register-bypass 
law  can  be  used  more  readily  in  the  next  step  of  the  simplification.  The  circuit 
in  Figure  11  shows  the  microarchitecture  after  this  step  has  been  completed. 
Notice  that  the  ALU  and  memory  units  are  now  connected  exactly  as  required 
for  an  application  of  the  hazard-bypass  law. 


Fig.  11.  Microarchitecture  after  the  “propagate  hazard  information”  step 


Remove  forwarding  logic  We  can  now  apply  the  hazard-bypass  law  to  remove 
the  bypass  circuit  just  prior  to  the  memory  unit.  We  eliminate  the  other  two 
bypass  circuits  by  applying  the  register-bypass  law  twice. 

Cleanup  The  pipeline  has  now  been  simplified  as  much  as  possible,  except  that 
there  are  still  some  extra  delay  components  as  well  as  several  unnecessary  pro¬ 
jection  circuits.  We  merge  delay  components,  then  move  the  projection  circuits 
back  to  their  places  of  origin  and  remove  them  using  the  projection  laws  in  the 
opposite  direction. 
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Fig.  12.  Microarchitecture  after  simplification 


The  final  microarchitecture  is  shown  in  Figure  12.  This  circuit  still  outputs 
exactly  the  same  transaction  values,  cycle-for-cycle,  as  the  microarchitecture  in 
Figure  10,  but  is  considerably  less  complex.  We  can  now  apply  conventional 
techniques  to  verify  that  this  microarchitecture  is  a  valid  implementation  of  the 
ISA. 

5  Discussion 

5.1  Related  work 

Hawk  is  built  on  top  of  the  pure  functional  language  Haskell,  where  algebraic 
techniques  for  transforming  functional  programs  are  routinely  used  for  equiva¬ 
lence  checking  and  verification  [2, 3, 13]  and  for  compilation  and  optimization  [5, 
12].  Much  of  our  work  can  be  seen  as  an  extension  of  these  ideas.  Hawk  itself  is 
very  similar  in  flavor  to  Lustre  [6]  except  that  in  Lustre  signals  are  accompanied 
by  additional  clock  information.  The  Hawk  specification  style  follows  from  the 
work  of  Johnson[9],  0’Donnell[18],  and  Sheeran[25]. 

We  have  also  been  influenced  by  the  algebraic  techniques  used  in  the  re¬ 
lational  hardware-description  language  Ruby  [24].  Sizeable  Ruby  circuits  have 
been  successfully  derived  and  verified  through  algebraic  manipulation  [10,11]. 
What  distinguishes  our  work  is  our  focus  on  microarchitectural  units  as  objects 
of  study  in  their  own  right.  The  Ruby  research  has  emphasized  circuits  at  the 
gate  level. 

In  terms  of  verification,  our  approach  is  most  similar  to  two  known  tech¬ 
niques,  called  retiming  [14,23,26]  and  unpipelining  [15].  A  circuit  is  retimed 


when  the  delay  components  of  the  circuit  are  repositioned,  while  the  functional 
components  are  left  unchanged,  effectively  through  repeated  applications  of  the 
time-invariance  law.  Typically,  circuits  are  retimed  to  reduce  the  clock  cycle 
time.  In  contrast,  we  retime  circuits  as  part  of  a  simplification  process.  In  fact, 
we  often  use  the  time  invariance  law  to  increase  cycle  time! 

Unpipelining  [15]  is  a  verification  technique  where  a  pipelined  microarchitec¬ 
ture,  specified  as  a  state  machine,  is  incrementally  transformed  into  a  functionally- 
equivalent  unpipelined  microarchitecture.  Unpipelining  proceeds  by  repeatedly 
merging  the  last  stage  of  a  pipeline  into  the  next  to  last  stage,  producing  a  mi¬ 
croarchitecture  with  one  less  stage  on  each  iteration.  On  each  iteration,  the  two 
microarchitectures  are  proven  equivalent  by  induction  over  time.  This  is  simi¬ 
lar  to  our  approach,  except  that  we  use  transactions  to  encapsulate  and  reuse 
many  of  the  verification  steps,  and  we  only  need  to  prove  the  equivalence  of 
the  portion  of  the  microarchitecture  being  transformed,  rather  than  the  entire 
microarchitecture,  on  each  iteration.  On  the  other  hand,  Levitt  and  Olukotun’s 
implementation  of  unpipelining  is  much  more  automated  than  our  work  up  to 
now. 

Transactions  were  a  key  concept  in  allowing  us  to  discover  and  formulate 
many  of  the  algebraic  laws  of  microarchitectural  components.  Unsurprisingly, 
the  usefulness  of  transactions  has  been  noticed  before.  Aagaard  and  Leeser 
used  transactions  to  specify  and  verify  hierarchical  networks  of  pipelines  [1], 
and  Onder  and  Gupta  have  used  a  similar  concept  of  instruction  contexts  as  a 
core  datatype  in  UPFAST,  an  imperative  microarchitecture  simulation  language 
[19].  Further,  Sawada  and  Hunt  use  an  extended  form  of  transactions  in  their 
verification  of  a  speculative  out-of-order  microarchitecture  [22].  Each  transaction 
records  two  snapshots  of  the  entire  ISA  state,  before  and  after  the  instruction 
is  executed.  In  their  work,  however,  transactions  are  not  part  of  the  microarchi¬ 
tecture  itself,  but  are  constructed  separately  for  verification  purposes. 

5.2  Next  steps  in  microarchitecture  algebra 

As  we  have  come  to  see  it,  the  main  principle  of  applying  algebraic  techniques 
to  microarchitectures  is  to  use  geometric  reasoning  to  move  and  absorb  circuits, 
and  to  express  that  reasoning  as  local  equalities  whenever  possible.  Conditional 
equalities  can  be  expressed  using  projections. 

Some  care  is  required  in  the  definition  of  basic  components.  We  have  striven 
to  design  the  component  circuits  to  satisfy  as  rich  a  variety  of  algebraic  laws  as 
possible,  such  as  preserving  default  values,  satisfying  time-invariance,  and  so  on. 
Sometimes  we  hit  on  the  correct  definitions  immediately,  but  more  commonly 
adapted  the  definitions  over  time  admitting  more  and  more  laws.  One  example  of 
this  is  in  pipeline  registers.  Initially,  we  used  conditional  delays  to  act  as  pipeline 
registers,  but  since  then  have  found  it  useful  to  separate  clocked  behavior  from 
functional  behavior,  enabling  the  two  dimensions  to  be  manipulated  separately. 

In  some  sense  the  components  we  now  manipulate  are  not  optimal  in  terms  of 
transistor  counts.  In  particular,  many  units  receive  and  propagate  information 
they  are  not  interested  in.  However,  much  of  this  overhead  can  be  removed 


automatically  through  a  similar  set  of  rewrite  laws  built  around  more  primitive 
components  than  those  presented  in  this  paper.  We  plan  to  write  this  up  in  a 
subsequent  paper. 
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Abstract.  Using  the  notions  of  unique  fixed  point ,  converging  equivalence  re¬ 
lation ,  and  contracting  function  we  generalize  the  technique  of  well-founded 
recursion.  We  are  able  to  define  functions  in  the  Isabelle  theorem  prover  that 
recursively  call  themselves  an  infinite  number  of  times.  In  particular,  we  can 
easily  define  recursive  functions  that  operate  over  coinductively- defined  types, 
such  as  infinite  lists.  Previously  in  Isabelle  such  functions  could  only  be  de¬ 
fined  corecursively,  or  had  to  operate  over  types  containing  “extra”  bottom- 
elements.  We  conclude  the  paper  by  showing  that  the  functions  for  filtering 
and  flattening  infinite  lists  have  simple  recursive  definitions. 


1  Well-founded  recursion 

Rather  than  specify  recursive  functions  by  possibly  inconsistent  axioms,  several  higher 
order  logic  (HOL)  theorem  provers[4,ll,  14]  provide  well-founded  recursive  function 
definition  packages,  where  new  functions  can  be  defined  conservatively.  Recursive 
functions  are  defined  by  giving  a  series  of  pattern  matching  reduction  rules,  and  a 
well-founded  relation. 

For  example,  the  map  function  applies  a  function  /  pointwise  to  each  element  of 
a  finite  list.  This  function  can  be  defined  using  well-founded  recursion: 

map  ::  (a  -4  /3)  -A  a  list  — ►  ft  list 

map  f  Q  =[] 

map  f  (x#xs)  —  (/ x)  #  (map  f  xs) 

The  first  rule  states  that  map  applied  to  the  empty  list,  denoted  by  [],  is  equal  to 
the  empty  list.  The  second  rule  states  that  map  applied  to  a  list  constructed  out  of 
the  head  element  x  and  tail  list  xs ,  denoted  by  x#xs ,  is  equal  to  the  list  formed  by 
applying  /  to  x  and  map  f  to  xs  recursively. 

To  define  a  function  using  well-founded  recursion,  the  user  must  also  supply  a  well- 
founded  relation  on  one  of  the  function’s  arguments1.  A  well-founded  relation  (<) 
is  a  relation  with  the  property  that  there  exists  no  infinite  sequence  of  elements 
x\,X2,xz')xa,  . . .  such  that 


. . .  <  £4  <  xs  <  x2  <  Xi 


For  each  reduction  rule,  the  recursive  definition  package  checks  that  every  recursive 
call  on  the  right-hand  side  of  the  rule  is  applied  to  a  smaller  argument  than  on  the 
left-hand  side,  according  to  the  user  supplied  well-founded  relation. 

1  Some  well-founded  recursion  packages  only  allow  single- argument  functions  to  be  defined. 
In  this  case  one  can  gain  the  effect  of  multi-argument  curried  functions  by  tupling. 


In  the  case  of  map ,  we  can  supply  the  well-founded  relation 


xs  <  ys  =  length  xs  <  length  ys 

which  is  true  when  the  number  of  elements  in  the  relation’s  left-hand  list  argument  is 
less  than  the  number  of  elements  in  the  relation’s  right-hand  argument.  The  definition 
of  map  contains  only  one  recursive  rule,  and  it  is  easy  to  prove  that  the  xs  argument 
of  the  recursive  call  of  map  is  smaller  than  the  (x#xs)  argument  on  the  left-hand  side 
of  the  rule,  according  to  this  relation.  In  general,  well-founded  relations  ensure  that 
there  are  no  infinite  chains  of  nested  recursive  calls. 

2  Coinductive  types  and  corecursive  functions 

Although  well-founded  recursion  is  a  useful  definition  technique,  there  are  many  re¬ 
cursive  definitions  that  fall  outside  its  scope.  For  instance,  there  is  a  non-inductive 
type  of  lazy  lists  in  the  Isabelle[ll]  theorem  prover,  denoted  by  a  Hist ,  that  is  the  set 
of  all  finite  and  infinite  lists  of  type  a.  The  function  Imap  over  this  type  is  uniquely 
specified  by  the  following  recursive  equations2: 

Imap  f  0  =  O 

Imap  f  ( x#xs )  =  (/  x)  #  ( Imap  f  xs) 

One  cannot  define  Imap  using  well-founded  recursion  since  the  length  of  an  infinite 
list  does  not  decrease  when  you  take  its  tail.  In  fact,  the  expression 
Imap  f  (xi  #  X2  #  x3  #  . . .)  can  be  unfolded  using  the  above  rules  to  an  infinite  chain 
of  recursive  calls: 

Imap  f  (xi  #x2#x3#...) 

(fx  1)  #  (Imap  f  (x2  #  x3  #  ...)) 

(fx i)  #  (fx2)  #  (Imap  f  (x3  #  ...)) 

(fx i)  #  (/  x2)  #  (/ x3)  #  (Imap  f  (...)) 


Defining  functions  corecursiveiy 

The  a  Hist  type  is  an  example  of  a  coinductive  type.  Although  there  is  no  general 
induction  principle  for  coinductive  types,  one  can  use  principles  of  coinduction  to 
show  that  two  coinductive  values  are  equal,  and  one  can  build  coinductive  values 
using  corecursion. 

In  Isabelle’s  theory  of  lazy  lists[12],  for  instance,  one  builds  potentially  infinite  lists 
through  the  llist.corec  operator,  which  has  type  /?  ->  (/?  unit  +  (a*  j3))  (a  Hist). 
The  Hist.corec  operator  uniquely  satisfies  the  following  recursion  equation: 

Hist  concha  =1^  if  ^6  =  Ini () 

^  |  (x#  (llist.corec  b'  g)),  if  gb  =  Inr  (x,  b1) 

The  llist.corec  operator  takes  as  arguments  an  initial  value  b  and  a  function  g.  When 
g  is  applied  to  6,  it  either  returns  Ini  (),  indicating  that  the  result  list  should  be  empty, 

2  Isabelle  uses  a  different  syntax  for  lazy  lists  than  for  finite  lists.  In  this  paper  we  use  the 
same  syntax  for  both  types. 


or  the  value  Inr(x,6'),  where  x  represents  the  first  element  of  the  result  list,  and  b ' 
represents  the  new  initial  value  to  build  the  rest  of  the  list  from.  Function  g  is  called 
iteratively  in  this  fashion,  constructing  a  potentially  infinite  list. 

Using  llist.corec,  we  can  define  Imap  corecursively  as  follows: 

Imap  f  xs  =  Hist -corec  xs  {map -head  f) 
where 

mapJxead  ::  (a  ->  j3)  — >  a  Hist  {unit  +  {/3  *  a  Hist)) 

map -head  f  xs  =  case  xs  of 

0  =>InlO 

|  (x#xs')  =»  Inr  (/  x ,  xs1) 

One  can  then  prove  by  coinduction  that  this  definition  satisfies  /map’s  recursive  equa¬ 
tions.  Needless  to  say,  this  is  not  the  most  intuitive  specification  of  Imap ,  and  most 
people  would  prefer  to  specify  such  functions  using  recursion,  if  possible.  In  the  re¬ 
mainder  of  the  paper  we  will  present  a  framework  for  defining  functions  such  as  Imap 
recursively. 

3  Solving  recursive  equations 

The  basic  steps  required  in  this  framework  to  show  that  a  set  of  recursive  equations 
is  well  defined  are  as  follows: 

-  Construct  a  single  function  F  that  characterizes  the  set  of  recursive  equations. 

-  Show  that  for  any  two  different  potential  solutions  supplied  to  F,  F  maps  them 
to  two  potential  solutions  that  are  closer  together,  in  a  suitable  sense. 

-  Invoke  the  main  result  (Sect.  4.3)  to  show  that  the  above  property  of  F  is  suffi¬ 
cient  to  guarantee  that  there  is  a  unique  solution  to  the  original  set  of  recursive 
equations. 

In  this  section  we  deal  with  the  first  step. 

3.1  Unique  fixed  points 

We  convert  a  system  of  pattern  matching  recursive  equations  into  a  functional  form 
by  employing  a  standard  technique  from  domain  theory[5, 17].  We  start  by  recasting 
the  equations  as  a  single  recursive  equation  using  argument  destructors  or  nested 
case-expressions.  For  example,  the  recursive  equations  defining  the  Imap  function  are 
equivalent  to  the  following  single  recursive  equation: 

Imap  f  l  =  case  l  of 

D  =*■  D 

|  (x#xs)  =>  (/  x)  #  {Imap  f  xs) 

Given  /,  we  can  reify  this  pattern  of  recursion  into  a  non-recursive  function  F  of 
type  (a  Hist  — ►  (3  Hist)  — »  (a  Hist  — >  ft  Hist)  that  takes  a  functional  parameter  lmap-f: 

F  ImapJ  =  XI .  case  l  of 

D  =>  0 

|  (x#xs)  =>  (/  x)  #  {lmap-f  xs). 

Using  the  recursive  equations  for  Imap ,  it  is  easy  to  show  that  Imap  f  =  F  {Imap  /). 
The  value  Imap  f  is  called  a  fixed  point  of  F.  In  general,  an  element  x  of  type  a  is 
a  fixed  point  of  a  function  g  of  type  a— >aifx  =  px.  A  function  may  have  many 
fixed  points,  or  none  at  all.  Considering  g  as  a  functional  representation  of  a  system 
of  recursive  equations,  each  fixed  point  of  g  represents  a  valid  solution  to  the  system. 
If  the  function  g  has  exactly  one  fixed  point  x,  then  we  can  think  of  g  as  defining  the 
value  x,  in  a  way  that  will  be  made  precise  shortly. 


Definition  1  A  function  f  of  type  a  a  has  a  unique  fixed  point  dement  x  of  type 
a  if  x  =  /  x  and 

V  y  z .  (y  =  /  y)  A  (z  =  f  z)  — >y  =  z. 

We  formalize  this  definition  into  a  predicate  of  higher  order  logic: 
isUniqFix  ::  a  -»  (a  ->  a)  — >  bool 

isUniqFix  x  f  =  x  =  fxA(Vyz.fy  =  fz  — >  y  =  z) 

To  define  elements  using  unique  fixed  points,  we  rely  on  Hilbert’s  description  operator 

00: 

fix  ::  (a  -4  a)  -4  a 
fix  /  =  ex  .  is  UniqFix  x  f 

The  expression  fix/  represents  the  unique  fixed  point  of  /,  when  one  exists.  The 
following  lemma  captures  this  fact: 

Lemma  1  If  there  exists  an  x  such  that  isUniqFix x  f  holds,  then 

x  =  fix  /  =  /  (fix  /) 

If  /  does  not  have  a  unique  fixed  point,  then  fix  /  denotes  an  arbitrary  value. 

3.2  Properties  of  unique  fixed  points 

As  an  aside,  several  nice  properties  hold  when  one  can  establish  that  a  system  of  recur¬ 
sive  equations  has  a  unique  solution.  For  example,  unique  fixed  points  can  sometimes 
“absorb”  functions  applied  to  other  fixed  points. 

Lemma  2  Given  functions  F  :  a  — >  a,  G  :  0  0,  f  :  a  0,  and  value  x  :  a,  such 

that  x  is  a  (not  necessarily  unique)  fixed  point  of  F,  G  has  a  unique  fixed  point,  and 
f  o  F  —  G  o  /,  then  f  x  =  fix  G. 

Proof  We  have  fx  =  f(Fx)  =  (foF)x  =  (G  o  f)x  =  G(fx).  Thus  the  value  fx 
is  a  fixed  point  of  G .  But  since  fix  G  is  the  unique  fixed  point  of  G ,  then  /  x  =  fix  G  qed 

Unique  fixed  points  can  also  be  “rotated” ,  in  the  following  sense: 

Lemma  3  If  the  composition  of  two  functions  g  :  0  a  and  h  :  a  0  has  a 
unique  fixed  point  fix  (g  oh),  then  hog  also  has  a  unique  fixed  point,  and  fix  (g  o  h)  — 
g  (fix  (h  op)). 

Proof  Let  b  =  fix  (g  o  h).  We  first  note  that  h  b  =  h  ((g  o  h)  b)  =  (h  o  g)  (h  b).  Thus 
h b  is  a  fixed  point  of  hog .  Next  we  show  that  this  fixed  point  is  unique  by  showing 
that  any  two  fixed  points  of  h  o  g  are  equal. 

Suppose  x  and  y  are  fixed  points  of  ho  g.  Then  gx  =  g  ((ho  g)x)  =  (g  o  h)  (gx). 
Thus  gx  is  a  fixed  point  of  g  o  h.  But  since  g  oh  has  a  unique  fixed  point,  then 
gx  =  fix  (g  o  h).  Similarly,  gy  =  fix  (g  o  h),  and  so  gx  =  gy.  Applying  h  to  both  sides 
of  this  equality,  we  obtain  h  (gx)  —  h  (gy),  which  is  the  same  as  (h  o  g)  x  =  (h  o  g)  y. 
Since  both  x  and  y  are  fixed  points  of  h  o  g,  we  have  x  =  y. 

We  can  now  apply  Lemma  2,  setting  F  =  hog,  G  =  goh,  f  =  g,  and  x  =  fix  (hog), 
to  conclude  that  g  (fix  (h  o  g))  =  fix  (g  o  h)  qed 

Although  we  will  not  use  Lemma  2  or  Lemma  3  in  the  remainder  of  the  paper, 
lemmas  such  as  these  are  useful  for  manipulating  systems  of  recursive  equations  as 
objects  in  their  own  right. 


4  Converging  equivalence  relations  and  contracting  functions 


While  unique  fixed  points  are  a  useful  definition  mechanism,  it  can  be  difficult  to 
show  that  they  exist  for  a  given  function.  A  direct  proof  usually  involves  constructing 
an  explicit  fixed  point  witness  using  other  definition  techniques,  such  as  corecursion 
or  well-founded  recursion.  Little  effort  seems  to  be  saved. 

We  propose  an  alternative  proof  technique,  based  on  concepts  from  domain  the¬ 
ory^,  17]  and  topology [1,13]  where  one  builds  a  collection  of  ever-closer  approxima¬ 
tions  to  the  desired  fixed  point,  and  show  that  the  limit  of  these  approximations  exists, 
is  a  fixed  point  of  the  function  under  consideration,  and  is  unique.  The  approximation 
process  can  be  parameterized  to  some  extent,  and  reused  across  multiple  definitions 
that  are  “similar”  enough.  Furthermore  these  parameterized  approximations  can  be 
composed  hierarchically,  yielding  more  powerful  approximation  techniques. 


4.1  Converging  equivalence  relations 

To  make  the  notion  of  approximation  precise,  we  need  a  way  of  stating  how  “close” 
two  potential  approximations  are  to  each  other.  One  approach  would  be  to  define  a 
suitable  metric  space[l]  and  use  the  corresponding  distance  function,  which  returns 
either  a  rational  or  real  number,  given  any  two  elements  in  the  domain  of  the  metric 
space.  However,  proving  that  a  series  of  approximations  converges  to  a  limit  point 
often  requires  one  to  reason  about  exponentiation  and  division  over  a  theory  of  ratio¬ 
nal  or  reals.  An  alternative  way  to  measure  “closeness” ,  which  we  call  a  converging 
equivalence  relation  (CER),  instead  only  involves  reasoning  about  well-founded  sets, 
such  as  the  set  of  natural  numbers,  or  the  set  of  finite  lists.  In  many  cases  we  can 
prove  a  unique  fixed  point  exists  by  performing  a  simple  induction  over  the  natural 
numbers,  something  which  all  of  the  current  HOL  theorem  provers  support  well. 

A  converging  equivalence  relation  consists  of: 

-  A  type  a,  called  the  resolution  space 

-  A  type  /?,  called  the  target  space 

—  A  well-founded,  transitive  relation  (<)  over  type  a,  called  a  resolution  ordering 

—  A  three-argument  predicate  («)  of  type  (a  0  -4  0  -»  bool),  called  an  indexed 
equivalence  relation.  Given  an  element  i  of  type  a,  and  two  elements  x  and  y  of 

type  0,  we  denote  the  application  of  («)  to  i ,  x  and  y  as  (x  «  y),  and  if  this  value 
is  true,  then  we  say  that  x  and  y  are  equivalent  at  resolution  i. 

The  resolution  ordering  (<)  and  indexed  equivalence  relation  («)  must  satisfy  the 
properties  in  Fig.  1,  for  arbitrary  i, if  :  a;  x,y,z  :  0\  and  /  :  a  ->  0.  Axioms  (1),  (2), 
and  (3)  state  that  («)  must  be  an  equivalence  relation  at  each  resolution  i.  Axiom  (4) 
states  that  if  a  resolution  i  has  no  lower  resolutions,  then  («)  treats  all  target  elements 
as  equivalent  at  that  resolution.  Such  resolutions  are  called  minimal  There  is  always  at 
least  one  minimal  resolution  (and  perhaps  more  than  one),  since  (<)  is  well-founded. 
Axiom  (5)  states  that  if  two  elements  are  equivalent  at  a  particular  resolution,  then 
they  are  equivalent  at  all  lower  resolutions.  Thus  higher  resolutions  impose  finer- 
grained,  but  compatible,  partitions  of  the  target  space  than  lower  resolutions  do. 
Although  no  particular  resolution  may  distinguish  all  elements,  (6)  states  that  if  two 
elements  are  equivalent  at  all  resolutions,  then  they  are  in  fact  equal. 

Axioms  (7)  and  (8)  deal  with  “limits”  of  approximations.  First  some  terminology: 
a  function  /  :  a  -4  0  from  the  space  of  resolutions  to  the  target  space  of  elements  is 
called  an  approximation  map.  An  approximation  map  /  is  convergent  up  to  resolution 
i  if  for  all  resolutions  j  and  f  such  that  j  <  jl  <  i,  then  (/  j)  is  equivalent  at 
resolution  j  to  (/  j').  Note  that  it  is  possible  for  ( fi )  itself  not  to  be  equivalent  to 


i  i 

xzzy  — y  y  «  x 

x  »  y  Ay  i 

(Vj  .  ->(j  <  i))  — ►  x  «  y 

il  .  ./  i 

x«t/Az<z  — 


(V?  .  x  «  y)  — >  x-y 

(Vj'./.j  <3  <i  — +•  (/i)  «  (//))  — »  (3z.Vj  <  ».2  «  (/i)) 
(Vi,  / .  j  <  j'  — »  (/  i) « (//))  — » (3z  •  Vi .  2  « (/  i)) 


(1) 

(2) 

(3) 

(4) 

(5) 

(6) 

(7) 

(8) 


Fig.  1.  The  CER  axioms.  Each  of  these  axioms  must  hold  for  arbitrary  i,  x,  t/,  and  /. 


any  of  the  lower-resolution  (/ j)?s.  An  approximation  map  /  is  globally  convergent  if 

j 

for  all  resolutions  j  and  j 1  such  that  j  <  jr,  then  (/  j)  «  {S  j')- 

Axiom  (7)  states  that  if  /  is  locally  convergent  up  to  resolution  z,  then  there  exists 
a  limit-like  element  z  that  is  equivalent  at  each  resolution  j  <  i  to  the  corresponding 
(f  j)  approximation.  Axiom  (8)  states  that  if  /  is  globally  convergent,  then  there 
exists  a  limit  element  z  that  is  equivalent  to  each  approximation  (/  j)  at  resolution  j. 


4.2  Examples  of  converging  equivalence  relations 

Discrete  CER  The  simplest  useful  CER  has  as  a  resolution  space  a  two-element 
type  containing  the  values  _L  and  T,  with  (_L  <  T),  and  a  target  space  /?  with  («) 

defined  such  that  (x  «  y)  =  Thze,  and  (x  «  y)  =  (x  =  t/).  Axioms  (1)  through  (6) 
are  easy  to  verify.  Axiom  (7)  holds  for  any  element.  The  limit  element  satisfying  (8) 
is  /  T. 


Lazy  list  CER  We  can  construct  a  converging  equivalence  equation  for  comparing 
coinductive  lists  by  comparing  the  first  i  elements  of  two  lazy  lists  li  and  I2  at  a  given 
resolution  i .  To  perform  the  comparison,  we  make  use  of  the  Itake  function,  with  type 
nat  -»  a  Hist  — >  a  list.  The  expression  ( Itake  n  xs)  returns  a  finite  list  consisting  of  the 
first  n  elements  of  xs.  If  xs  has  fewer  than  n  elements,  then  Itake  returns  the  whole 
of  xs.  The  Itake  function  can  be  defined  by  well-founded  recursion  on  its  numeric 
argument  with  the  following  recursive  equations: 

Itake  0  xs  =  0 

Itake  (n  + 1)  []  =  Q 

Itake  (n  +  1)  (x  #  xs)  =  x  #  ( Itake  n  xs) 

We  then  define  the  lazy  list  CER  with  the  natural  numbers  as  the  resolution  space, 
(a  Hist)  as  the  target  space,  the  usual  ordering  on  the  natural  numbers  for  (<),  and 
(«)  defined  as  follows: 


xs  «  ys  =  (Itake  i  xs  =  Itake  i  ys). 

Axioms  (1)  through  (3)  hold  trivially.  The  only  minimal  resolution  in  this  CER  is  0, 
and  since  (Itake  Oxs)  =  Q,  then  (4)  holds.  If  two  lazy  lists  are  equal  up  to  the  first 
i  positions,  then  they  are  equal  up  to  any  if  <  i  position,  so  (5)  holds.  Axiom  (6) 
reduces  to  the  Take  Lemma[12],  which  can  be  proved  by  coinduction. 


Axioms  (7)  and  (8)  require  us  to  construct  appropriate  limit  elements,  given  an 
approximation  map.  Both  limit  elements  can  be  constructed  by  a  single  function, 
which  we  call  llist-diag.  For  a  given  approximation  map  /,  the  limit  elements  may  be 
of  infinite  length,  so  we  define  llist-diag  by  corecursion,  using  llist-corec: 


llist-diag  f  =  llist-corec  0  ( nthElem  f) 
where 


nthElem  f  n  = 


f  Ini  (),  if  Idrop  n  ( f(n  +  1))  =  D 

\  Inr  (x,  n  +  1),  if  Idrop  n  (f(n  +  1))  =  (x  #  xs ) 


The  helper  function  nthElem  uses  the  Idrop  function  on  lazy  lists.  The  Idrop  function 
has  type  nat  -*  ( a  Hist)  -*  (a  Hist ),  and  ( Idrop  i  xs)  removes  the  first  i  elements  from 
xs,  returning  the  remainder.  Like  Itake ,  it  is  defined  by  well-founded  recursion  on  its 
numeric  argument: 

Idrop  0  xs  =  xs 

Idrop  (n  + 1)  [|  =  0 

Idrop  ( n  +  1)  (x  #  xs)  -  Idrop  n  xs 


The  overall  action  of  llist^diag  is  to  construct  a  so-called  diagonal  list  from  the 
approximation  map  /,  where  the  nth  element  of  the  result  list  is  drawn  from  the 
nth  element  of  approximation  /  (n  +  1),  if  the  nth  element  exists.  If  the  nth  element 
does  not  exist  (i.e.,  the  length  of  f  (n  4-  1)  is  less  than  n),  then  the  result  list  is 

terminated  at  that  point.  This  process  is  shown  in  Fig.  2.  There  are  two  possible 

cases.  In  Fig.  2-a,  we  see  that  the  approximation  map  /  converges  to  the  finite  list 
[x0,  xi,  x2,  X3,  X4].  In  Fig.  2-b,  the  approximation  map  /  is  converging  to  the  infinite 

list  [xq  ,  X\ ,  x2  j  X3 ,  X4 ,  X5  j  Xg ,  . .  •] 


Fig.  2.  The  llist-diag  function  constructs  a  limit  list  from  an  approximation  mapping.  In  (a) 
the  approximation  mapping  converges  to  a  finite  list;  In  (b)  to  an  infinite  list. 


It  turns  out  that  for  any  CER  whose  (<)  relation  is  the  less-than  ordering  on  the 
natural  numbers,  the  following  property  implies  both  (7)  and  (8): 

V/ .  (Vi .  (/  i)  w  (/  (<  4- 1)))  — ►  (3* .  Vi .  x  «  (/  0). 

With  some  work,  one  can  show  that  this  property  holds  for  the  lazy  list  CER  by 
supplying  llist.diag  f  as  the  existential  witness  element  for  x. 

4.3  Contracting  functions 

In  the  theory  of  metric  spaces,  a  contracting  function  is  a  function  F  such  that  for 
any  two  points  x  and  y,  Fx  is  closer  to  Fy  than  x  is  to  y ,  given  a  suitable  distance 
function.  Banach’s  theorem  states  that  all  contracting  functions  over  suitable  metric 
spaces  have  unique  fixed  points.  We  can  define  an  analogous  notion  over  a  CER: 


Definition  2  A  function  F  is  contracting  over  a  CER  given  by  (<)  and  («)  if  for 
all  resolutions  i  and  target  elements  x  and  y, 


(Vi'  <  i  .x  &y)  — »  (F x)  «  (Fy). 

Intuitively  a  function  is  contracting  if,  given  two  elements  x  and  y  that  are  close 
enough  together  at  all  lower  resolutions  i'  <  i  to  satisfy  the  CER,  but  are  potentially 
too  far  away  at  resolution  i,  then  F  maps  them  to  two  elements  that  are  now  close 
enough  at  resolution  i. 

For  example,  the  function  consZero  xs  =  (0 #xs)  is  contracting  over  the  lazy  list 
CER,  since  given  any  i  and  two  lazy  lists  xs  and  ys , 

(Vi1  <  i .  Hake  il  xs  =  Hake  if  ys)  — >  Itake  i  ( consZero  xs)  =  Itake  i  ( consZero  ys). 

The  main  result  of  this  paper  is  as  follows: 

Theorem  A  contracting  function  F  over  a  CER  has  a  unique  fixed  point. 

The  proof  is  discussed  in  Sect.  7.  For  now,  we  would  like  to  apply  this  theorem 
to  define  some  simple  recursive  functions  over  lazy  lists. 

4.4  Recursive  definitions  over  coinductive  lists 

To  begin  with,  we  can  simplify  the  definition  of  a  contracting  function  F  over  a  CER 
when  the  (<)  relation  of  that  CER  is  the  less-than  relation  over  the  natural  numbers. 
In  this  case,  Definition  2  reduces  to 

Vixy  .x  «  y  — >  {Ex)  (F  y).  (9) 

Specializing  this  formula  for  the  lazy  list  CER,  we  have  that  F  is  contracting  on  lazy 
lists  if 

\/ixy .  Itake  i  x  =  Itake  i  y  — >  Itake  (i  +  1)  (F x)  =  Itake  (i  +  1)  (F y).  (10) 

Defining  iterates  Let  us  establish  that  the  following  recursive  equation,  defined 
over  x  and  /,  has  a  unique  solution,  and  thus  a  definition: 

iterates  =  (x  #  {Imap  f  iterates))  (11) 

This  equation  builds  the  infinite  list  [x,  f  x,  /  (/x), . . .].  We  first  define  the  non¬ 
recursive  function  F  that  characterizes  this  equation: 

F  iterates1  =  {x  #  ( Imap  f  iterates1)). 

and  then  show  that  it  is  a  contracting  function.  To  do  this  we  rely  on  (10),  and 
assume  we  have  two  arbitrary  lazy  lists  xs  and  ys  such  that  Itake  ixs  =  Itake  i  ys.  We 
now  need  to  show  that  Itake  (i  4*  1)  {F xs)  =  Itake  (i  -f  1)  ( F  ys).  Using  a  process  of 
equational  simplification  we  are  able  to  reduce  the  goal  to  the  assumption,  as  follows: 

Itake  (i  +  1)  (F  xs)  =  Itake  ( i  +  1)  (F  ys) 

Itake  ( i  +  1)  {x  #  {Imap  f  xs))  =  Itake  {i  +  1)  (a;#  {Imap  f  ys)) 

Itake  i  { Imap  f  xs)  —  Itake  i  {Imap  f  ys) 

<=  Itake  ixs  =  Itake  i  ys 


The  simplification  relies  on  the  following  facts,  each  proved  by  induction  on  i : 

( Itake  ( i  +  1)  (z  #  xs)  =  Hake  ( i  +  1)  (z  #  ys))  <£>  ( Itake  i  xs)  =  Hake  i  ys) 

( Itake  i  ( Imap  f  xs)  =  Itake  i  ( Imap  f  ys)  4=  ( Itake  i  xs  =  Itake  i  ys) 

These  facts  illustrate  a  nice  property  of  this  proof:  We  did  not  have  to  expand  the 
definitions  of  (#)  or  Imap  during  the  simplification  process,  relying  instead  on  an 
abstract  characterization  of  their  behavior  with  respect  to  Itake .  This  turns  out  to 
be  the  case  for  many  functions,  even  recursive  ones  defined  by  contracting  functions. 
In  general  we  can  often  incrementally  define  recursive  functions  and  prove  properties 
about  how  they  behave  with  respect  to  («),  without  having  to  expand  the  definitions 
of  functions  making  up  the  body  of  the  recursive  definition. 

5  Composing  converging  equivalence  relations 

The  lazy  list  CER  allows  us  to  give  recursive  definitions  of  individual  lazy  lists,  but 
we  are  often  more  interested  in  recursively  defining  functions  that  transform  lazy 
lists.  Fortunately,  there  are  several  CER  combinators  that  allow  us  to  build  CERs 
over  complex  types,  if  we  have  CERs  that  that  operate  on  the  corresponding  atomic 
types. 

Local  and  global  limits  When  constructing  a  new  CER  C"  out  of  an  existing 
CER  C,  we  usually  have  to  show  (7)  and  (8)  hold  for  Cl  by  invoking  (7)  and  (8)  for 
(7,  to  create  the  necessary  limit  witness  elements.  To  make  this  process  explicit,  we 
use  Hilbert’s  description  operator  (e)  to  create  functions  that  return  these  witness 
elements,  given  an  appropriate  approximation  mapping  /: 

local  Jimit  ::  (a  — ►  ft)  -¥  a  -¥  P 

j 

local  Jimit  f  i  =  (ez .  V?  <  i  .  z  «  (/  j))  (12) 

global  Jimit  ::  (a  -¥  /?)  ->  /? 

global  Jimit  f  =  (ez .  Vj .  z  w  ( f  j ))  (13) 

We  can  use  (7)  and  (8)  to  prove  the  basic  properties  we  want  local  Jimit  and  global  Jimit 
to  have  for  any  CER  given  by  (<)  and  («): 

<j'<i  — ¥  ( fj )  «  (//))  — >  (Vj  <  i .  (local -limit  f  i)  «  (f  j)) 

(Vj,/  .j  <  j'  — *•  (fj)  «  (fj'))  — >  (Vj  •  (global-limit  f)  is  (f  j)) 

Function-space  CER  The  functions  local  Jimit  and  global  Jimit  allow  us  to  con¬ 
cisely  specify  the  limit  elements  of  CER  combinators.  For  example,  given  a  CER  C 
from  resolution  space  a  to  target  space  0  given  by  (<)  and  («),  we  can  construct  a 
new  function-space  over  C  CER  with  the  same  resolution  ordering  (<),  and  a  new 
indexed  equivalence  relation  («')  with  type 
a  -»  (r  -¥  P)  -¥  (r  -»■  0)  -*  bool ,  defined  as 

*  i 

g  h  =  V  x  .  (g  x)  «  (h  x) . 

The  limit  elements  satisfying  (7)  and  (8)  can  be  given  as 

local  Jimit  jun  f  i  =  ( Aa: .  local  Jimit  (A  i  .fix)  i) 
global  Jimit -fun  f  =  (Ax .  global  Jimit  (A  i ,  fix)) 


Given  these  limit-producing  functions,  is  relatively  easy  to  show  that  the  function- 
space  over  C  CER  satisfies  the  CER  axioms. 


5.1  Defining  recursive  functions  with  the  function-space  CER 

Defining  lmap  We  can  apply  the  function-space  CER  to  define  lmap  recursively. 
The  recursion  equations  for  lmap  are: 

lmap  f  0  =  0 

lmap  f  ( x#xs )  =  (/  x)  #  (lmap  f  xs) 

We  translate  the  equations  into  a  non-recursive  form  (parameterized  over  /) 

F  lmap '  =  (A xs  .  case  xs  of 

o  =>□ 

I  (y  #  ys)  =>  (/y)  #  (lmap1  ys )). 

We  then  need  to  show  that  fix  F  is  the  unique  fixed  point  of  F  by  proving  that  F  is  a 
contracting  function  on  the  function-space  over  lazy  lists  CER.  By  (9)  we  must  show 

%  (*+i) 

for  arbitrary  resolution  i  and  functions  g  and  h)  that  ( g  h  — >  (Fg)  «'  (Fh)). 

Expanding  definitions,  we  obtain 


i  (*+i) 

9  »'  h—>(Fg)  «'  (Fh) 

i  (i+1) 

(V  xs  .  g  xs  «  h  xs)  — >  (V  xs  .(F  g  xs)  «  (F  h  xs)) 
&  (V  xs .  Itake  i  (g  xs)  =  Itake  i  ( h  xs))  — > 

(V  xs .  Itake  ( i  +  1)  (F  g  xs)  —  Itake  ( i  +  \)(Fh  xs)). 


So,  to  prove  F  is  contracting  we  take  an  arbitrary  resolution  i  and  two  arbitrarily 
chosen  functions  g  and  h  such  that  (V  xs  .  Itake  i  ( gxs )  =  Itake  i  (h  xs)) ,  and  show  for 
an  arbitrary  xs  that  Itake  ( i  +  l)(Fg  xs)  =  Itake  (i  -f  1)  (F  h  xs).  There  are  two  cases 
to  consider: 

case  xs  = 

Itake  (i  +  1)  (F  g  xs)  =  Itake  ( i  -hi )  (F  h  xs) 

Itake  (i  +  1)  (F g  Q)  =  Itake  ( i  +  1)  (F  h  []) 

<=>  Itake  ( i  +  1)  (case  []  of 

D  ^0 

I  {y  #  ys)  =>  (/  y)  #  (g  ys))  = 

Itake  (i  +  1)  (case  []  of 

D  =>  D 

I  (y#ys)  =>•  ify)#{hys)) 

<=>  Itake  (i  +  1)  []  =  Itake  ( i  +  1)  0 
True. 

case  xs  =  (y#ys): 

Itake  (i  +  1)  (F  g  xs)  =  Itake  ( i  +  1)  (F  h  xs) 

Itake  (i  +  1)  (Fg(y#ys))  =  Itake  ( i  -f  1)  (Fh (y#ys)) 

Itake  ( i  4- 1)  (case  (y#ys)  of 

D  =►  D 


I  (y  #  ys)  =>  (f  y)  #  (g  ys))  = 


Itake  ( i  +  1)  (case  ( y#ys )  of 
0  =>0 

I  (y#ys)  =>  (fv)#(hys)) 

&  Hake  (i  +  1)  ((/  y)  #  (g  ys ))  =  Itake  ( i  +  1)  ((/  y)  #  ( h  ys )) 
&  Itake  i  ( g  ys)  =  Itake  i  ( h  ys) 

True  {by  assumption}. 


Given  the  definition  of  F  and  basic  lemmas  about  Itake ,  Isabelle’s  high-level  sim¬ 
plification  tactics  allow  the  above  proof  to  be  carried  out  in  two  steps.  The  proof 
completes  in  about  a  second  on  a  266MHz  Pentium  II. 


Defining  lappend  We  can  apply  the  function-space  CER  combinator  repeatedly,  to 
prove  that  multi- argument  curried  functions  have  unique  fixed  points.  As  a  concrete 
example,  the  curried  function  lappend  has  type  a  Hist  -4  a  Hist  -4  a  Hist.  It  takes 
two  lazy  list  arguments  xs  and  ys  and  returns  a  new  list  consisting  of  the  elements  of 
xs  followed  by  the  elements  of  ys.  The  recursive  equations  for  lappend  are 

lappend  []  ys  —  ys 

lappend  ( x#xs )  ys  =  (x#  lappend  xs  ys) 

To  prove  that  these  equations  have  a  unique  solution,  we  apply  the  function-space 
CER  combinator  to  the  lazy  list  CER  to  obtain  a  new  CER  Cl .  We  then  apply  the 
function-space  CER  combinator  again  to  C",  obtaining  a  new  CER  Cn  with  the  usual 
less-than  relation  on  nat  for  (<)  and  the  following  indexed  equivalence  relation  («"): 

i 

g  h~  (V xs  ys  .  Itake  i  ( g  xs  ys)  =  Itake  i  ( h  xs  ys)). 

Next,  we  convert  the  recursive  equations  for  lappend  into  a  non-recursive  function  F : 

F  lappend '  =  (Axs  ys  .  case  xs  of 

0  ^  ys 

|  (x#xs')  (x#  (lappend'  xs*  ys))). 

By  (9)  we  must  show  for  arbitrary  resolution  i  and  functions  g  and  /i,  that 

(V  xs  ys  .  Itake  i(gxsys)  =  Itake  i  ( h  xs  ys))  — > 

(V xs  ys  .  Itake  ( i  +  1)  (Fgxs  ys)  =  Itake  ( i  +  1  )(Fh  xs  ys)). 

So  we  take  arbitrary  i,  xs,  and  t/s,  and  prove 

Itake  (i  +  1)  (F  g  xs  ys)  =  Itake  (i  +  1)  (F  h  xs  ys) 

assuming  we  have  (V xs  ys  .  Itake  %  (g  xs  ys)  =  Itake  i  ( h  xs  ys)).  There  are  two  cases  to 
consider,  depending  on  whether  xs  is  empty  or  not: 

case  xs  =  \\: 

Itake  (i  +  1)  (F  g  xs  ys)  =  Itake  ( i  +  l)(Fh  xs  ys) 

Itake  (i  +  1)  (F  g  []  ys)  =  Itake  (i  +  1)  (F  h  []  ys) 

Itake  (i  +  1)  (case  [|  of 

D  =>  ys 

|  (x#xs')  =►  x#(gxs'ps))  = 

Itake  (i  +  1)  (case  []  of 

0  =>  ys 

|  (x  #  xs{)  =>  x  #  (h  xs '  ys)) 


<4  Itake  (i  +  1)  ys  =  Hake  (i  +  1)  ys 
True. 


case  xs  =  (x#£s'): 

Zta&e  (i  +  1)  (F  gxs  ys)  =  Itake  ( i  +  1)  (F  hxs  ys) 

^  Zia&e  (i  4  1)  (F  y  (x#xs‘)  ys)  =  Itake  (i  +  1)  (F  h  (x#xsf)  ys) 

O  Itake  ( i  4  1)  (case  ( x#xsf )  of 

D  =>  vs 

I  (x#xsf)  =>  x#(gxs'y$))  = 

Itake  ( i  -h  1)  (case  (x#xsf)  of 

0  =>  vs 

|  (x  #  xs')=>  x#(h  xsf  ys)) 

Itake  (i  +  l)(x#  ( g  xs 9  ys))  =  Itake  ( i  4- 1)  {x  #  ( h  xs 1  ys)) 

&  Itake  i  ( g  xs '  ys)  =  Itake  i  ( h  xs 1  ys) 

<=>  True  {by  assumption}. 

Thus  we  can  conclude  that  lappend  has  a  unique  fixed  point  definition.  We  were  able 
to  carry  out  this  proof  in  Isabelle  in  three  steps,  again  taking  about  a  second  of  CPU 
time. 


5.2  Other  CER  combinators 

Cer  combinators  can  also  be  defined  over  product  and  sum  types.  The  lazy  list  CER 
can  be  generalized  to  work  over  any  coinductive  type  that  has  a  notion  of  depth,  such 
as  coinductive  trees.  A  more  powerful  function-space  CER  is  discussed  in  Sect.  6. 


5.3  Demonstrating  equality  between  coinductive  elements 

Converging  equivalence  relations  can  also  be  useful  in  showing  that  two  elements  of 
a  target  space  are  equal.  Axiom  (6)  (restated  below)  says  that  to  show  two  target 
elements  x  and  y  are  equal,  one  simply  needs  to  show  they  are  equivalent  at  all 
resolutions  j 

(Vi  .x  «  y)  — >  x-y. 

We  can  often  demonstrate  that  x  and  y  are  equivalent  at  all  resolutions  by  well- 
founded  induction,  since  (<)  is  a  well-founded  relation.  For  example,  given  two  arbi¬ 
trary  lazy  lists  ys  and  zs ,  we  can  prove  the  following  lemma  about  lappend  by  (simple) 
induction  on  i: 

Lemma  4 

Vxs  .  Itake  i  ( lappend  ( lappend  xs  ys)  zs)  =  Itake  i  ( lappend  xs  ( lappend  ys  zs)). 


Proof 

case  i  —  0: 

Take  xs  to  be  an  arbitrary  lazy  list.  Then 

Itake  i  ( lappend  ( lappend  xs  ys)  zs)  =  Itake  i  ( lappend  xs  ( lappend  ys  zs)) 
<£>  Itake  0  ( lappend  ( lappend  xs  ys)  zs)  =  Itake  0  ( lappend  xs  ( lappend  ys  zs)) 

o  =  o 

<£>  True. 


case  i  —  (k  4*  1): 

Induction  hypothesis: 

Assume  (Vxs .  Itake  k  ( lappend  ( lappend  xs  ys)  zs)  — 

Itake  k  ( lappend  xs  ( lappend  ys  zs))) 

Take  res  to  be  an  arbitrary  lazy  list.  Then 

Itake  i  ( lappend  ( lappend  xs  ys)  zs)  =  Itake  i  ( lappend  xs  ( lappend  ys  zs)) 

( Itake  ( k  +  1)  ( lappend  ( lappend  xs  ys)  zs)  = 

Itake  ( k  -f  1)  ( lappend  xs  ( lappend  ys  zs))) 

subcase  xs  =  []: 

&  ( Itake  (k  +  1)  ( lappend  ( lappend  Q  ys)  zs)  = 

Itake  ( k  +  1)  ( lappend  0  ( lappend  ys  zs))) 

&  ( Itake  ( k  +  1)  ( lappend  ys  zs)  = 

Itake  ( k  +  1)  ( lappend  ys  zs)) 

True. 

subcase  res  =  (x  #  res'): 

( Itake  (k  +  1)  ( lappend  ( lappend  (x  #  res')  ys)  zs)  = 

Itake  (k  +  1)  ( lappend  (x  #  xs')  ( lappend  ys  zs))) 

<=>  (Itake  ( k  *f  1)  ( lappend  ( x  #  ( lappend  xs '  ys))  zs)  = 

Itake  ( k  +  1)  (x  #  ( lappend  xs '  ( lappend  ys  zs)))) 

( Itake  (k  +  1)  (x  #  ( lappend  ( lappend  xs '  ys)  zs))  = 

Itake  ( k  +  1)  (x  #  (lappend  res'  (lappend  ys  zs)))) 

<*=>  (ftafce  k  (lappend  (lappend  xs 1  ys)  zs)  = 

Itake  k  (lappend  res'  (lappend  ys  zs))) 

True  {by  induction  hypothesis}. 

This  proof  took  four  steps  in  Isabelle,  and  relied  on  the  following  facts  about  lappend , 
each  proved  in  two  steps  by  expanding  lappend' s  recursive  definition  once  and  simpli¬ 
fying: 


lappend  \\ys  =  ys 

lappend  (x#xs)  ys  =  x  #  (lappend  xs  ys) 

Given  Lemma  4  and  CER  axiom  (6)  instantiated  to  the  lazy  list  CER,  we  can  then  eas¬ 
ily  show  in  one  Isabelle  step  that  lappend  (lappend  xs  ys)  zs  =  lappend  xs  (lappend  ys  zs) 

6  Defining  functions  with  unbounded  look-ahead 

The  functions  we  have  defined  so  far  examine  their  arguments  by  performing  at  most 
one  pattern  match  on  a  lazy  list  before  producing  an  element  of  a  result  list.  However, 
there  is  a  class  of  functions  that  can  examine  a  potentially  infinite  amount  of  their 
argument  lists  before  deciding  the  next  element  to  output.  An  example  is  the  lazy 
filter  function  of  type  (a  ->  bool)  -»  a  Uist  a  llist ,  which  takes  a  predicate  P  and  a 
lazy  list  xs ,  and  returns  a  lazy  list  of  the  same  type  consisting  only  of  those  elements 
of  xs  satisfying  P.  A  candidate  set  of  recursion  equations  for  this  function  might  be 

IfilterP  D  =D 

Ifilter  P  (x#xs)  =  Ifilter  P  xs ,  if  -*(P  x) 

Ifilter  P  (x#xs)  =  x#  (Ifilter  P xs),  if  Px 

Sadly,  this  intuitively  appealing  set  of  equations  does  not  completely  define  Ifilter.  If 
Ifilter  is  given  an  infinite  list  X5,  none  of  whose  elements  satisfy  P,  then  the  above 


equations  do  not  specify  what  the  result  list  should  be.  For  example,  the  equations 
are  satisfied  if  Ifilter  returns  in  this  case  the  infinite  list  [ arb ,  arb, . . .],  where  arb  is  an 
arbitrary  element  of  the  appropriate  type.  In  other  words,  the  equations  do  not  have 
a  unique  solution. 

Happily  we  can  remedy  the  situation  as  follows:  We  define  by  induction  over  nat 
a  predicate  firstPelemAt  of  type  (a  -¥  bool)  — >  a  llist  — ►  nat  ->  bool.  The  expression 
(firstPelemAt  P  xs  i )  is  true  if  xs  has  at  least  ( i  +  1)  elements  and  i  is  the  position  of 
the  first  element  of  xs  satisfying  P.  We  can  then  define  the  predicate  never  of  type 
(a  -¥  bool)  a  llist  -4  bool  as 

never  Pxs  =  Vi .  -'(firstPelemAt  P  xs  i) 

which  is  true  when  there  are  no  elements  in  xs  satisfying  P .  If  we  modify  the  initial 
recursive  equations  as  follows: 

Ifilter  P  xs  =  [] ,  if  never  P  xs 

Ifilter  P  (x#xs)  =  Ifilter  Pxs,  if  -^(never  P  xs)  A  -> (P x) 

Ifilter  P  (x#xs)  =  z#  (Ifilter  Pxs),  if  -(never  P xs)  A  Px 

then  the  set  of  equations  does  indeed  have  a  unique  solution.  This  function  is  not 

computable,  since  the  predicate  never  can  scan  an  infinite  number  of  elements,  but 
it  is  nevertheless  mathematically  valid  in  HOL.  The  CERs  described  above  are  not 
powerful  enough  to  prove  this,  but  we  can  define  a  well-founded  function- space  CER 
combinator  that  is.  Given  a  CER  C  with  (<)  of  type  a  a  ->•  bool  and  («)  with 
type  a  -»  (3  bool ,  and  another  well-founded  transitive  relation  (-<)  of  type 

t  — y  r  — y  bool ,  we  define  our  new  CER  C*  with  (<')  and  («')  as  follows: 

(<')  ::  (a  *  r)  (a  *  r)  -y  bool 

(«')  ::  (a  *  r)  (r  — y  /3)  ->  (r  — >  0)  bool 

( a ;,  t')  <f  (a,  t)  =  a'  <  a  V  (a1  =  a  A  t*  ^  t) 

5  (w;)  /i  =V a' *'.((«',*')  <'  (a,0)  V  ((a',0  =  (a,*))  — >  »  (* O 

It  is  a  fair  amount  of  work  to  show  that  Cf  is  in  fact  a  CER,  and  space  constraints 
force  us  to  elide  the  details. 

Intuitively,  however,  C!  allows  us  to  generalize  well-founded  recursion  in  the  fol¬ 
lowing  way:  A  well-founded  recursive  function  is  forced  to  have  its  argument  decrease 
in  size  on  every  recursive  call.  With  Cr ,  the  function  being  defined  is  allowed  a  choice; 
it  can  either  decrease  the  size  of  its  argument  when  making  a  recursive  call,  or  not 
decrease  its  argument  size  but  then  make  sure  the  element  it  is  returning  is  “better” 
than  the  element  returned  from  its  recursive  call. 

In  the  case  of  functions  returning  lazy  lists,  a  “better”  lazy  list  is  one  that  looks 
just  like  the  lazy  list  returned  by  the  recursive  call,  but  with  at  least  one  extra  element 
added  to  the  front. 

For  us  to  use  Cf  on  Ifilter ,  we  need  to  specify  a  suitable  well-founded  transitive 
relation  (^).  The  relation  we  choose  is  one  that  holds  when  the  first  element  satisfying 
P  occurs  sooner  on  the  left-hand  argument  than  on  the  right-hand  argument: 

xs  -<  ys  =  firstPelem  P  xs  <  firstPelem  P  ys 
where 

firstPelem  P  xs  =0,  if  never  P  xs 

=  1  +  (ei .  firstPelemAt  P  xs  i) ,  otherwise 

We  arbitrarily  decide  that  a  list  containing  no  P-elements  is  -<-smaller  than  any  list 
with  at  least  one  P-element. 


When  analyzing  the  revised  recursive  equations  for  Ifilter ,  if  xs  has  no  P-element  s 
then  we  return  immediately,  otherwise  xs  has  to  have  at  least  one  P-element.  If  that 
element  is  not  at  the  head  of  the  list,  then  the  tail  of  the  list  is  X-smaller  than  xs .  If 
the  first  P-element  is  at  the  head  of  xs ,  then  the  tail  of  the  list  is  not  ^-smaller  than 
xs ,  but  the  output  list  has  one  more  element  than  the  list  returned  by  the  recursive 
call.  Thus  we  informally  conclude  that  the  Ifilter  is  uniquely  defined. 

We  have  also  proved  this  fact  formally  in  Isabelle.  After  inductively  proving  various 
simple  lemmas  about  firstPelemAt ,  never ,  and  firstPelem ,  we  were  able  to  prove  that 
Ifilter  is  uniquely  defined  in  five  steps.  We  first  translated  the  recursive  equations 
above  into  a  contracting  function  F.  We  used  C '  prove  that  F  is  contracting,  first  by 
expanding  the  definition  of  F  and  simplifying,  and  then  by  performing  a  case  analysis 
(no  induction  required!)  on  whether  the  nat  component  of  the  current  resolution  was 
equal  to  zero.  It  took  Isabelle  two  seconds  to  perform  the  proof. 

Although  we  had  to  prove  lemmas  about  firstPelemAt ,  never ,  and  firstPelem , 
the  proofs  are  not  hard  and  it  turns  out  we  can  reuse  these  results  when  defining 
other  functions  that  perform  unbounded  search  on  lazy  lists.  For  example,  the  Iflatten 
function  takes  a  lazy  list  of  lazy  lists,  and  flattens  all  of  the  elements  into  a  single 
lazy  list.  The  Iflatten  function  can  also  be  uniquely  defined  using  never: 

Iflatten  xss  =  0,  if  never  (A xs  .xs  ^  \\)  xss 

Iflatten  (xs#xss)  =  lappend  xs  (Iflatten  xss),  otherwise 

The  proof  proceeds  in  Isabelle  exactly  as  it  does  for  Ifilter  except  that  we  perform 
one  additional  case  analysis  on  whether  £$  =  [].  The  proof  takes  three  seconds  to 
complete. 

7  Proof  of  the  main  result 

Although  the  proof  of  the  main  theorem  is  too  lengthy  to  describe  here,  we  will  provide 
a  rough  outline.  Given  a  CER  with  resolution  space  a,  target  space  /?,  well-founded 
relation  (<),  indexed  equivalence  relation  («),  and  an  arbitrary  contracting  function 
F  of  type  fi  ->  /?,  the  technique  will  be  to  construct  an  approximation  map  apx  F 
that  converges  globally  to  the  desired  fixed  point.  We  then  prove  that  this  fixed  point 
is  unique  by  showing  that  any  two  fixed  points  of  F  are  equal. 

The  function  apx  of  type  (ft  -*»  fi)  a  fi  that  builds  an  approximation  map 
from  a  contracting  function  is  defined  by  well-founded  recursion  on  (<)  as  follows: 

apx  Fi  =  F  (local Jimit  ( cut  (apx  F)  i)  i) 
where 

cut  fix  =  if  x  <  i  then  f  x  else  arbitrary. 

At  each  resolution  i ,  the  function  apx  uses  locaLlimit  to  obtain  the  best  possible 
approximation  of  fixF,  given  the  approximations  it  has  already  computed  at  all  lower 
resolutions.  The  result  of  calling  locaLlimit  may  still  not  be  close  enough  at  resolution 
i,  so  apx  maps  the  local  limit  through  F,  which  will  bring  the  result  close  enough.  The 
helper  function  cut  is  used  to  ensure  that  the  recursive  call  to  apx  F  is  only  made  at 
lower  resolutions  than  i,  ensuring  well-foundedness.  If  locaLlimit  attempts  to  invoke 
cut  (apx  F)  i  at  any  other  resolution,  then  cut  returns  an  arbitrary  element  instead. 

Once  we  have  proved  by  well-founded  induction  that  apx  is  well  defined,  the  next 
step  is  to  establish  that  apx  F  is  convergent  up  to  each  resolution  i.  To  do  this  we  prove 
several  lemmas,  such  as:  if  an  approximation  mapping  /  converges  up  to  a  local  limit 
element  z  at  resolution  i,  and  also  converges  up  to  a  local  limit  element  z'  at  the  same 
resolution,  then  z  and  z'  are  equivalent  at  all  resolutions  i '  <  i.  With  this,  and  the  fact 


t  X 

that  F  is  contracting,  we  can  show  that  if  x  &  y,  then  Fx  «  Fy.  We  then  eventually 
show  for  all  resolutions  z  that  if  apx  F  converges  up  to  local  limit  element  apx  F  i  at 

resolution  z,  then  apxFi  i4  F(apxFi).  This  lemma  is  the  key  to  showing  by  well- 
founded  induction  over  i  that  apx  F  does  in  fact  converge  up  to  apx  F  i  at  resolution 

z,  and  is  also  used  to  show  that  global-limit  (apx  F)  «  F  ( global  .limit  (apx  F))  at  each 
resolution  z,  and  are  thus  equal  by  (6).  This  result  establishes  that  a  fixed  point  exists 
for  F.  We  then  show  that  any  two  fixed  points  x  and  y  of  F  are  equivalent  at  all 
resolutions  by  well-founded  induction,  and  thus  are  equal,  again  by  (6). 


8  Conclusion 

Related  work  The  support  for  and  application  of  well-founded  induction  and  gen¬ 
eral  coinduction  has  seen  wide  acceptance  in  the  HOL  theorem  proving  community. 
The  well-founded  definition  package  TFL  used  in  HOL98  and  Isabelle  was  written  by 
Slind[15].  It  can  handle  nested  pattern  matching  in  rule  definitions,  nested  recursion 
in  function  bodies,  and  generates  custom  induction  rules  for  each  definition [16].  The 
PVS  theorem  prover[14]  also  uses  well-founded  induction  as  a  basic  definitional  prin¬ 
ciple.  A  general  theory  of  inductive  and  coinductive  sets  in  Isabelle  was  developed  by 
Paulson[12],  based  on  least  and  greatest  fixed  points  of  monotone  set-transforming 
functions,  as  well  as  a  package  for  defining  new  inductive  and  coinductive  sets  by  user- 
given  introduction  rules.  The  package  avoids  syntactic  restrictions  in  the  introduction 
rules  by  reasoning  about  each  rule’s  underlying  set-transformer  semantics. 

Paulson’s  Isabelle  theories  were  applied  by  Frost [3]  to  formalize  the  static  and 
dynamic  semantics  of  a  small  functional  language  and  prove  that  the  two  semantics 
were  consistent  with  each  other.  Recursive  functions  are  represented  by  infinitely 
nested  environments,  requiring  consistency  to  be  proved  by  coinduction.  The  language 
and  proof,  as  well  as  the  concept  of  coinduction  as  a  variant  of  fixpoint  induction, 
were  introduced  by  Milner  and  Tofte[8]. 

A  coinductive  theory  of  streams  (infinite-only  lists)  was  developed  by  Miner[9] 
in  the  PVS  theorem  prover.  Miner  used  this  theory  to  model  synchronous  hardware 
circuits  as  corecursively-defined  stream  transformers.  Using  coinduction,  he  was  able 
to  optimize  the  implementation  of  a  fault-tolerant  clock  synchronization  circuit  and 
a  floating-point  division  circuit.  In  several  cases  a  subcircuit  was  replaced  by  an 
optimized  subcircuit,  and  the  correctness  of  the  replacement  depended  on  non-trivial 
environmental  assumptions  in  the  surrounding  circuit.  Coinduction  was  used  to  verify 
the  environmental  assumptions  and  to  show  that  the  subcircuits  were  equivalent  under 
the  assumed  environment. 

A  well-known  alternative  to  coinductive  types  is  the  mathematical  framework 
of  pointed  complete  partial  orders  and  continuous  functions ,  also  known  as  domain 
theory[ 5, 17].  This  theory  is  supported  by  the  HOLCF[10]  object-logic  in  Isabelle,  and 
also  allows  one  to  define  infinite  data  structures  such  as  lazy  lists  and  trees.  A  wide 
variety  of  functions  over  these  structures  can  then  be  recursively  defined.  The  primary 
disadvantage  of  this  approach  is  that  one  must  add  “extra”  bottom-elements  to  the 
structures  being  defined.  These  extra  elements  are  used  to  indicate  that  a  function 
is  non-terminating  on  its  arguments.  For  example,  the  lazy  filter  function  Ifilter  can 
be  defined  recursively  in  HOLCF,  but  the  expression  Ifilter  P  xs  returns  _L  instead  of 
\\  when  xs  is  an  infinite  list  containing  no  elements  satisfying  P.  Also,  only  so-called 
admissible  predicates  can  be  reasoned  about  inductively  in  domain  theory,  and  it  can 
be  quite  challenging  to  prove  that  a  desired  predicate  is  admissible.  A  comparison  of 
the  HOLCF  approach  to  several  other  encodings  of  lazy  lists  is  presented  by  Devillers 
et  al[2]. 


Metric  spaces  [13]  and  topology [1]  are  another  well-established  definition  mecha¬ 
nism.  The  notions  of  Cauchy  sequences,  complete  metric  spaces,  and  contractions 
inspired  much  of  this  work.  We  have  not  worked  out  the  exact  relationship  between 
converging  equivalence  relations  and  Cauchy  metric  spaces;  although  one  can  con¬ 
struct  a  distance  function  for  every  CER  based  on  the  nat  resolution  space,  it  is  not 
clear  that  distance  functions  can  be  always  be  constructed  for  more  complex  resolu¬ 
tion  spaces.  Also,  the  conditions  under  which  a  function  F  is  contracting  in  a  CER 
seem  to  be  less  restrictive  than  the  corresponding  conditions  in  a  metric  space.  More 
importantly  from  a  verification  perspective,  well-founded  induction  seems  easier  to 
apply  in  current  theorem  provers  than  does  the  continuous  mathematics  required  for 
metric  spaces. 


Current  and  future  work  We  are  currently  using  CERs  to  specify  and  reason 
about  processor  microarchitectures  as  recursively  defined  stream  transformers.  This 
work  is  part  of  the  Hawk  project [7],  which  is  developing  a  domain-specific  functional 
language  for  specifying,  simulating,  and  reasoning  about  such  microarchitectures  at 
a  high  level  of  abstraction.  We  have  been  able  to  use  CERs  and  the  unique  fixed 
point  lemmas  in  Sect.  3.2  to  develop  a  domain-specific  microarchitecture  algebra[ 6]  in 
Isabelle,  which  we  use  to  verify  Hawk  specifications. 

We  have  mechanized  the  theory  of  CERs  in  Isabelle  and  have  been  able  to  define 
interesting  lazy  functions  recursively,  such  as  zip,  filter,  flatten,  and  several  microar¬ 
chitecture  specifications.  However,  we  did  so  by  reasoning  about  unique  fixed  points 
directly.  One  possibility  would  be  to  write  a  package  along  the  lines  of  TFL  where 
users  need  only  supply  a  system  of  pattern  matching  recursive  equations  and  a  CER. 
The  package  would  then  automate  the  unique  existence  proofs. 

We  have  not  yet  seriously  explored  nested  recursion  with  CERs,  but  we  would  like 
to  in  the  future. 

Although  we  have  defined  CERs  over  streams  and  lazy  lists,  many  structures  in 
language  semantics  and  process  algebras  can  be  seen  as  coinductive  trees.  It  would 
be  interesting  to  define  some  of  these  structures  recursively  and  reason  about  them 
inductively,  as  we  did  for  lappend  in  Sect.  5.3. 
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Abstract 

The  impact  of  Domain  Specific  Languages  (DSLs)  on  software  design  is  considerable.  They 
allow  programs  to  be  more  concise  than  equivalent  programs  written  in  a  high-level  programming 
languages.  They  relieve  programmers  from  making  decisions  about  data-structure  and  algorithm 
design,  and  thus  allows  solutions  to  be  constructed  quickly.  Because  DSL’s  are  at  a  higher  level 
of  abstraction  they  are  easier  to  maintain  and  reason  about  than  equivalent  programs  written 
in  a  high-level  language,  and  perhaps  most  importantly  they  can  be  written  by  domain  experts 
rather  than  programmers. 

The  problem  is  that  DSL  implementation  is  costly  and  prone  to  errors,  and  that  high  level 
approaches  to  DSL  implementation  often  produce  inefficient  systems.  By  using  two  new  pro¬ 
gramming  language  mechanisms,  program  staging  and  monadic  abstraction,  we  can  lower  the 
cost  of  DSL  implementations  by  allowing  reuse  at  many  levels.  These  mechanisms  provide  the 
expressive  power  that  allows  the  construction  of  many  compiler  components  as  reusable  libraries, 
provide  a  direct  link  between  the  semantics  and  the  low-level  implementation,  and  provide  the 
structure  necessary  to  reason  about  the  implementation. 


1  Introduction 

We  outline  an  improved  method  for  the  design  and  implementation  of  Domain-Specific  Languages 
(DSLs).  The  method  builds  upon  our  experience  with  staged  programming  using  the  staged  pro¬ 
gramming  language  MetaML  [27,  26].  The  method  also  incorporates  ideas  from  other  researchers 
in  the  areas  of  modular  language  design  [28,  24,  12],  correct  compiler  generation  [15,  19,  18,  16,  10], 
and  partial  evaluation  [8,  13].  While  relying  on  recent  advances  in  functional  programming  (such  as 
higher-order  type  constructors,  and  local  polymorphism),  it  is  applicable  to  all  kinds  of  languages , 
not  just  applicative  ones.  The  method  unifies  many  of  these  ideas  into  a  coherent  process. 

A  problem  with  the  DSL  approach  to  software  construction  is  its  cost.  Realizing  a  DSL  requires 
an  implementation.  Such  implementations  are  large  and  expensive  to  produce.  So,  unless  many 
solutions  are  required,  it  may  not  pay  to  build  a  compiler  or  other  implementation  mechanism. 
DSL  implementation  is  also  conceptually  hard.  Most  software  engineers  are  not  comfortable  taking 
on  the  task  of  language  design  and  implementatSion.  Even  if  they  are,  language  implementation 
is  a  difficult,  complex  process  that  does  not  easily  scale.  An  implementation  for  a  simple  language 
often  does  not  scale  as  the  language  evolves  to  meet  newer  demands.  Lowering  the  cost  of  DSL 
implementations,  and  making  good  ones  more  manageable,  will  make  the  DSL  approach  applicable 
to  a  broader  domain  of  problems. 

Our  approach  to  solving  these  problems  is  to  apply  new  methods  of  abstraction  such  as  mon¬ 
ads  [28,  31]  and  staging  [27,  26]  to  the  implementation  of  DSLs.  This  makes  the  effort  required  to 
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build  a  compiler  for  a  DSL  reusable  and  spreads  the  cost  over  several  DSLs.  To  make  language 
implementation  manageable  for  the  masses,  there  must  exist  good  rules  of  thumb  for  language 
implementation.  One  way  to  accomplish  this  is  by  elaborating  a  step  by  step  method  that  splits 
the  labor  into  well-defined  steps,  each  with  a  relatively  small  amount  of  work.  In  our  method,  each 
step  deals  with  an  orthogonal  design  decision.  By  using  good  abstraction  principles,  our  method 
partitions  each  design  decision  into  a  separate  code  module.  In  addition,  our  method  makes  explicit 
the  propositions  that  must  be  proved  to  show  the  correctness  of  the  compiler  with  respect  to  its 
semantics. 

Our  method  comprises  the  following  steps.  First,  construct  the  denotational  semantics  as 
an  interpreter  in  a  functional  language.  Second,  capture  the  effects  of  the  language,  and  the 
environment  in  which  the  target  language  must  run,  in  a  monad.  Then  rewrite  the  interpreter  in 
a  monadic  style.  Third,  stage  the  interpreter  using  meta-programming  techniques.  This  staging  is 
similar  to  the  staging  of  interpreters  using  a  partial  evaluator,  but  is  explicit  rather  than  implicit, 
since  the  programmer  places  the  annotations  directly,  rather  than  using  an  automatic  binding  time 
analysis  to  discover  where  they  should  be  placed.  This  leaves  programmers  in  complete  control, 
and  they  can  limit  what  appears  in  the  residual  program.  Fourth,  the  resulting  program  is  both  a 
data-structure  and  a  program,  so  it  can  be  both  directly  executed  and  analyzed.  This  analysis  can 
include  both  source  to  source  transformations,  or  translation  into  another  form  (i.e.  intermediate 
code  or  assembly  language).  Because  the  programmer  has  complete  control  over  the  earlier  steps, 
the  structure  of  the  residual  program  is  highly  constrained,  and  this  final  translation  can  be  a  trivial 
task. 

Staging  of  interpreters  using  partial  evaluation  has  been  done  before  [1,  5].  The  contribution  of 
this  paper  is  to  show  that  this  can  all  be  done  in  a  single  program.  A  system  incorporating  staging 
as  a  first  class  feature  of  a  language  is  a  powerful  tool.  While  using  such  a  tool  to  write  a  compiler 
the  source  language  can  be  given  semantics,  it  can  be  staged,  translated,  and  optimized  all  in  a 
single  paradigm.  It  requires  neither  additional  processes  nor  tools,  and  is  under  the  complete  control 
of  the  programmer;  all  the  while  maintaining  a  direct  link  between  the  semantics  of  interpreter  and 
those  of  the  compiler. 

2  Staging  in  MetaML 

MetaML  is  almost  a  conservative  extension  of  Standard  ML.  Its  extensions  include  four  staging 
annotations.  To  delay  an  expression  until  the  next  stage  one  places  it  between  meta-brackets. 
Thus  the  expression  <23>  (pronounced  “bracket  23”)  has  type  <int>  (pronounced  “code  of  int”). 
The  annotation,  ~e  splices  the  deferred  expression  obtained  by  evaluating  e  into  the  body  of  a 
surrounding  Bracketed  expression;  and  run  e  evaluates  e  to  obtain  a  deferred  expression,  and  then 
evaluates  this  deferred  expression.  It  is  important  to  note  that  ~e  is  only  legal  within  lexically 
enclosing  Brackets.  We  illustrate  the  important  features  of  the  staging  annotations  in  the  short 
MetaML  sessions  below. 

- 1  val  z  =  3+4; 
val  z  =  7  :  int 

Users  access  MetaML  through  a  read-type-eval-print  top-level.  The  declaration  for  z  is  read, 
type-checked  to  see  that  it  has  a  consistent  type  (int  here),  evaluated  (to  7),  and  then  both  its 
value  and  type  are  printed. 

-I  val  quad  =  (  3+4,  <3+4>,  lift  (3+4),  <z>  ); 
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val  quad  =  (7,  <3  7.+  4>,  <7>,  <7. z>  )  : 

(  int  *  <int>  *  <int>  *  <int>) 

The  declaration  for  quad  contrasts  normal  evaluation  with  the  three  ways  objects  of  type  code 
can  be  constructed.  Placing  brackets  around  an  expression  (<3+4>)  defers  the  computation  of  3+4 
to  the  next  stage,  returning  a  piece  of  code.  Lifting  an  expression  (lift  (3+4))  evaluates  that 
expression  (to  7  here)  and  then  lifts  the  value  to  a  piece  of  code  that  when  evaluated  returns  the 
same  value.  Brackets  around  a  free  variable  (<z>)  creates  a  new  constant  piece  of  code  with  the 
value  of  the  variable.  Such  constants  print  with  a  7,  sign  to  indicate  they  are  constants.  We  call 
this  lexical- capture  of  free  variables.  Because  in  MetaML  operators  (such  as  +  and  *)  are  also 
identifiers,  free  occurrences  of  operators  in  constructed  code  often  appear  with  7,  in  front  of  them. 

- |  fun  inc  x  =  <1  +  ~x>; 

val  inc  =  Fn  :  [,a].<int>  ->  <int> 

The  declaration  of  the  function  inc  illustrates  that  larger  pieces  of  code  can  be  constructed  from 
smaller  ones  by  using  the  escape  annotation.  Bracketed  expressions  can  be  viewed  as  frozen ,  i.e. 
evaluation  does  not  apply  under  brackets.  However,  is  it  often  convenient  to  allow  some  reduction 
steps  inside  a  large  frozen  expression  while  it  is  being  constructed,  by  “splicing”  in  a  previously 
constructed  piece  of  code.  MetaML  allows  one  to  escape  from  a  frozen  expression  by  prefixing  a 
sub-expression  within  it  with  the  tilde  (~)  character.  Escape  must  only  appear  inside  brackets. 

-|  val  six  =  inc  <5>; 
val  six  =  <1  7.+  5>  :  <int> 

In  the  declaration  for  six,  the  function  increment  is  applied  to  the  piece  of  code  <5>  constructing 
the  new  piece  of  code  <1  7#+  5>. 

-|  run  six; 
val  it  =  6  :  int 

Running  a  piece  of  code,  strips  away  the  enclosing  brackets,  and  evaluates  the  expression  inside. 
To  give  a  brief  feel  for  how  MetaML  is  used  to  construct  larger  pieces  of  code  at  run-time  consider: 

-|  fun  mult  x  n  =  if  n=0  then  <1>  else  <  ~x  *  "(mult  x  (n-1))  >; 
val  mult  =  fn  :  <int>  ->  int  ->  <int> 

- 1  val  cube  =  <fn  y  =>  "(mult  <y>  3)>; 

val  cube  =  <fn  a  =>  a  *  (a  *  (a  *  1))>  :  <int  ->  int> 

- 1  fun  exponent  n  =  <f n  y  =>  (mult  <y>  n) > ; 
val  exponent  =  fn  :  int  ->  <int  ->  int> 

The  function  mult,  given  an  integer  piece  of  code  x  and  an  integer  n,  produces  a  piece  of  code 
that  is  an  n-way  product  of  x.  This  can  be  used  to  construct  the  code  of  a  function  that  performs  the 
cube  operation,  or  generalized  to  a  generator  for  producing  an  exponentiation  function  from  a  given 
exponent  n.  Note  how  the  looping  overhead  has  been  removed  from  the  generated  code.  This  is  the 
purpose  of  program  staging  and  it  can  be  highly  effective  as  discussed  elsewhere  [4,  6,  11,  23,  27]. 
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3  Monads  in  MetaML 


We  assume  the  reader  has  a  working  knowledge  of  monads[29,  31].  We  use  the  unit  and  bind  formu- 
lation  of  monads[30].  In  MetaML  a  monad  is  a  data  structure  encapsulating  a  type  constructor 
Mand  the  unit  and  bind  functions. 

datatype  (>M  :  *  ->  *  )  Monad  =  Mon  of 

( [’a] .  J  a  ->  ’a  ’M)  *  (*  unit  function  *) 

([’a,’*)].  Ja  >M  ->  (’a  ->  *b  >M)  ->  yb  M)  ;  (*  bind  function  *) 

This  definition  uses  SML’s  postfix  notation  for  type  application,  and  two  non-standard  exten¬ 
sions  to  ML.  First,  it  declares  that  the  argument  (’M  :*->*)  of  the  type  constructor  Monad 
is  itself  a  unary  type  constructor  [7].  We  say  that  *M  has  kind :  *  ->  *.  Second,  it  declares  that 
the  arguments  to  the  constructor  Mon  must  be  polymorphic  functions  [17].  The  type  variables  in 
brackets,  e.g.  [;a,  ;b],  are  universally  quantified.  Because  of  the  explicit  type  annotations  in  the 
datatype  definitions  the  effect  of  these  extensions  on  the  Hindley-Milner  type  inference  system  is 
well  known  and  poses  no  problems  for  the  MetaML  type  inference  engine. 

In  MetaML,  Monad  is  a  first-class,  although  pre-defined  or  built-in  type.  In  particular,  there 
are  two  syntactic  forms  which  are  aware  of  the  Monad  datatype:  Do  and  Return.  Do  and  Return 
are  MetaML’s  syntactic  interface  to  the  unit  and  bind  of  a  monad.  We  have  modeled  them  after 
the  do-notation  of  Haskell[9,  20].  An  important  difference  is  that  MetaML’s  Do  and  Return  are 
both  parameterized  by  an  expression  of  type  *M  Monad.  Do  and  Return  are  syntactic  sugar  for  the 
following: 

(*  Syntactic  Sugar  Derived  Form  *) 

Do  (Mon (unit , bind))  {  x  <-  e;  f  }  -  bind  e  (fn  x  =>  f) 

Return  (Mon (unit , bind) )  e  =  unit  e 

In  addition  the  syntactic  sugar  of  the  Do  allows  a  sequence  of  x*  <-  e2-  forms,  and  defines  this 
as  a  nested  sequence  of  Do’s.  For  example: 

Do  m  {  xl  <-  el;  x2  <-  e2  ;  x3  <-  e3  ;  e4  }  = 

Do  m  {  xl  <-  el;  Do  m  {  x2  <-  e2  ;  Do  m  {  x3  <-  e3  ;  e4  }}} 

Users  may  freely  construct  their  own  monads,  though  they  should  be  very  careful  that  their 
instantiation  meets  the  monad  axioms.  The  monad  axioms,  expressed  in  MetaML’s  Do  and  Return 
notation  are: 

Do  {  x  <-  Return  e  ;  z  }  =  z[e/x] 

Do  {  x  <-  m  ;  Return  x  }  =  m 

Do  {  x  <-  Do  {  y  <-  a  ;  b  >  ;  c  }  *  Do  {  y*  <-  a  ;  Do  {  x  <-  b[yVy]  ;  c  >  } 

=  Do  {  y>  <-  a  ;  x  <-  btyVy]  ;  c  > 

4  Illustrating  our  compiler  development  method 

In  this  section,  we  illustrate  our  method  by  building  the  front  end  of  a  compiler  for  a  small  im¬ 
perative  while-language .  We  proceed  in  three  steps.  First,  we  introduce  the  language  and  its 
denotational  semantics  by  giving  a  monadic  interpreter  as  a  one  stage  MetaML  program.  Second, 
we  stage  this  interpreter  by  using  a  two  stage  MetaML  program  in  order  to  produce  a  compiler. 
Third,  we  illustrate  the  usefulness  of  the  staging  approach,  by  showing  how  using  MetaML’s  inten- 
sional  analysis  tools  can  be  used  to  optimize  or  further  translate  the  output  of  a  staged  program. 
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4,1  The  while-language 

In  this  section,  we  introduce  a  simple  while-language  composed  from  the  syntactic  elements:  ex¬ 
pressions  (Exp)  and  commands  (Com).  In  this  simple  language  expressions  are  composed  of  integer 
constants,  variables,  and  operators.  A  simple  algebraic  datatype  to  describe  the  abstract  syntax  of 
expressions  is  given  in  MetaML  below: 


datatype  Exp  = 


Constant  of  int 

(* 

5 

*> 

Variable  of  string 

(* 

X 

*) 

Minus  of  (Exp  *  Exp) 

<* 

x  -  5 

*) 

Greater  of  (Exp  *  Exp) 

c* 

x  >  1 

*> 

Times  of  (Exp  *  Exp)  ; 

<* 

x  *  4 

*) 

Commands  include  assignment,  sequencing  of  commands,  a  conditional  (if  command),  while 
loops,  a  print  command,  and  a  declaration  which  introduces  new  statically  scoped  variables.  A 
declaration  introduces  a  variable,  provides  an  expression  that  defines  its  initial  value,  and  limits  its 
scope  to  the  enclosing  command.  A  simple  algebraic  datatype  to  describe  the  abstract  syntax  of 
commands  is: 


datatype  Com  = 


Assign  of  (string  *  Exp) 

<* 

x  :=  1 

*) 

Seq  of  (Com  *  Com) 

c* 

{  x  :=  1;  y  :=  2 

} 

*) 

Cond  of  (Exp  *  Com  *  Cora) 

c* 

if  x  then  x  :=  1 

else 

y 

:=  1  *) 

While  of  (Exp  *  Com) 

<* 

while  x>0  do  x  := 

=  x  - 

i 

*) 

Declare  of  (string  *  Exp  *  Com) 

(* 

declare  x  =  1  in 

x  :  = 

X 

-  1  *) 

Print  of  Exp; 

(* 

print  x 

*) 

A  simple  while-program  in  concrete  syntax,  such  as 
declare  x  =  150  in 

declare  y  =  200  in  {  while  x  >  0  do  {  x  :=x-  1;  y  :=  y  -  1>;  print  y} 
is  encoded  abstractly  in  these  datatypes  as  follows: 
val  SI  = 

Declare ("x" , Constant  150, 

Declare  (  ny 11 ,  Constant  200, 

Seq(While (Greater (Variable  "x" , Constant  0), 

Seq(Assign("x" ,  Minus  (Variable  "x" ,  Constant  1)), 

Assign ("y" , Minus (Variable  "y", Constant  1)))), 

Print (Variable  "y " ) ) ) ) ; 

4.2  The  structure  of  the  solution 

Staging  is  an  important  technique  for  developing  efficient  programs,  but  it  requires  some  fore¬ 
thought.  To  get  the  best  results  one  should  design  algorithms  with  their  staged  solutions  in  mind. 

The  meaning  of  a  while-program  depends  only  on  the  meaning  of  its  component  expressions  and 
commands.  In  the  case  of  expressions,  this  meaning  is  a  function  from  environments  to  integers. 
The  environment  is  a  mapping  between  names  (which  are  introduced  by  Declare)  and  their  values. 

There  are  several  ways  that  this  mapping  might  be  implemented.  Since  we  intend  to  stage  the 
interpreter,  we  break  this  mapping  into  two  components.  The  first  component,  a  list  of  names,  will 
be  completely  known  at  compile-time.  The  second  component,  a  list  of  integer  values  that  behaves 
like  a  stack,  will  only  be  known  at  the  run-time  of  the  compiled  program. 
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The  functions  that  access  this  environment  distribute  their  computation  into  two  stages.  First, 
determining  at  what  location  a  name  appears  in  the  name  list,  and  second,  by  accessing  the  correct 
integer  from  the  stack  at  this  location.  In  a  more  complicated  compiler  the  mapping  from  names 
to  locations  would  depend  on  more  than  just  the  declaration  nesting  depth,  but  the  principle 
remains  the  same.  Since  every  variable’s  location  can  be  completely  computed  at  compile-time,  it 
is  important  that  we  do  so,  and  that  these  locations  appear  as  constants  in  the  next  stage. 

Splitting  the  environment  into  two  components  is  a  standard  technique  (often  called  a  binding 
time  improvement)  used  by  the  partial  evaluation  community [8].  We  capture  this  precisely  by  the 
following  purely  functional  implementation. 

type  location  =  int; 
type  index  =  string  list; 
type  stack  =  int  list; 

(*  position  :  string  ->  index  ->  location  *) 
fun  position  name  index  = 

let  fun  pos  n  (nm: :nms)  =  if  name  =  nm  then  n  else  pos  (n+1)  nms 
in  pos  1  index  end; 

(*  fetch  :  location  ->  stack  ->  int  *) 

fun  fetch  n  (v: :vs)  =  if  n  =  1  then  v  else  fetch  (n-1)  vs; 

(*  put:  location  ->  int  ->  stack  ->  stack  *) 

fun  put  n  x  (v: :vs)  =  if  n  =  1  then  x::vs  else  v::(put  (n-1)  x  vs); 

The  meaning  of  Com  is  a  stack  transformer  and  an  output  accumulator.  It  transforms  one  stack 
(with  values  of  variables  in  scope)  into  another  stack  (with  presumably  different  values  for  the  same 
variables)  while  accumulating  the  output  printed  by  the  program. 

To  produce  a  monadic  interpreter  we  could  define  a  monad  which  encapsulates  the  index, 
the  stack,  and  the  output  accumulation.  Because  we  intend  to  stage  the  interpreter  we  do  not 
encapsulate  the  index  in  the  monad.  We  want  the  monad  to  encapsulate  only  the  dynamic  part  of 
the  environment  (the  stack  of  values  where  each  value  is  accessed  by  its  position  in  the  stack,  and 
the  output  accumulation). 

The  monad  we  use  is  a  combination  of  monad  of  state  and  the  monad  of  output. 

datatype  ’a  M  =  StOut  of  (stack  ->  (*a  *  stack  *  string)); 

fun  unStOut  (StOut  f)  =  f; 

fun  unit  x  =  StOut (fn  n  =>  (x,n,MM)); 

fun  bind  e  f  =  StOut (fn  n  =>  let  val  (a,nl,sl)  =  (unStOut  e)  n 

val  (b,n2,s2)  =  unStOut (f  a)  nl 
in  (b,n2,sl  *  s2)  end); 

val  mswo  :  M  Monad  =  Mon(unit ,bind) ;  (*  Monad  of  state  with  output  *) 

The  non-standard  morphisms  must  describe  how  the  stack  is  extended  (or  shrunk)  when  new 
variables  come  into  (or  out  of)  scope;  how  the  value  of  a  particular  variable  is  read  or  updated;  and 
how  the  printed  text  is  accumulated.  Each  can  be  thought  of  as  an  action  on  the  stack  of  mutable 
variables,  or  an  action  on  the  print  stream. 

(*  read  :  location  ->  int  M  *) 

fun  read  i  =  StOut (fn  ns  =>  (fetch  i  ns, ns,”")); 

(*  write  :  location  ->  int  ->  unit  M  *) 
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fun  write  i  v  =  StOut (fn  ns  =>(  () ,  put  i  v  ns,  ,,M  )); 

(*  push:  int  ->  unit  M  *) 

fun  push  x  =  StOut(fn  ns  =>((),  x  : :  ns, 

(*  pop  :  unit  M  *) 

val  pop  =  StOut  (fn  (n:  :ns)  =>  ((),  ns,  ,MI)); 

(*  output:  int  ->  unit  M  *) 

fun  output  n  =  StOut(fn  ns  ->((),  ns,  (toString  n)~"  ")); 

4.3  Step  1:  monadic  interpreter 

Because  expressions  do  not  alter  the  stack,  or  produce  any  output,  we  could  give  an  evaluation 
function  for  expressions  which  is  not  monadic,  or  which  uses  a  simpler  monad  than  the  monad 
defined  above.  We  choose  to  use  the  monad  of  state  with  output  throughout  our  implementation 
for  two  reasons.  One,  for  simplicity  of  presentation,  and  two  because  if  the  while  language  semantics 
should  evolve,  using  the  same  monad  everywhere  makes  it  easy  to  reuse  the  monadic  evaluation 
function  with  few  changes. 

The  only  non-standard  morphism  evident  in  the  evall  function  is  read,  which  describes  how 
the  value  of  a  variable  is  obtained.  The  monadic  interpreter  for  expressions  takes  an  index  mapping 
names  to  locations  and  returns  a  computation  producing  an  integer. 

(*  evall:  Exp  ->  index  ->  int  M  *) 
fun  evall  exp  index  = 
case  exp  of 

Constant  n  =>  Return  mswo  n 
I  Variable  x  =>  let  val  loc  =  position  x  index 
in  read  loc  end 

I  Minus (x,y)  => 

Do  mswo  {  a  <-  evall  x  index  ; 

b  <-  evall  y  index; 

Return  mswo  (a  -  b)  } 

I  Great er(x,y)  => 

Do  mswo  {  a  <-  evall  x  index  ; 

b  <-  evall  y  index; 

Return  mswo  (if  a  *>’  b  then  1  else  0)  } 

I  Times(x,y)  => 

Do  mswo  {  a  <-  evall  x  index  ; 

b  <-  evall  y  index; 

Return  mswo  (a  *  b)  }; 

The  interpreter  for  Com  uses  the  non-standard  morphisms  write,  push,  and  pop  to  transform 
the  stack  and  the  morphism  output  to  add  to  the  output  stream. 

(*  interpret 1  :  Com  ->  index  ->  unit  M  *) 
fun  interpretl  stmt  index  = 
case  stmt  of 

Assign (name, e)  => 
let  val  loc  =  position  name  index 

in  Do  mswo  {  v  <-  evall  e  index  ;  write  loc  v  }  end 
I  Seq(sl,s2)  => 

Do  mswo  {  x  <-  interpretl  si  index; 
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y  <-  interpret  1  s2  index; 

Return  mswo  ()  } 

I  Cond(e,sl,s2)  => 

Do  mswo  {  x  <-  evall  e  index; 
if  x=l 

then  interpret 1  si  index 
else  interpret 1  s2  index  } 

I  While (e, body)  => 
let  fun  loop  ()  = 

Do  mswo  {  v  <-  evall  e  index  ; 

if  v=0  then  Return  mswo  () 

else  Do  mswo  {  interpret 1  body  index  ; 
loop  ()  >  > 

in  loop  ()  end 
I  Declare(nm,e,stmt)  => 

Do  mswo  {  v  <-  evall  e  index  ; 
push  v  ; 

interpretl  stmt  (nm: : index); 
pop  > 

I  Print  e  => 

Do  mswo  {  v  <-  evall  e  index; 
output  v  }; 

Although  interpretl  is  fairly  standard,  we  feel  that  two  things  are  worth  pointing  out.  First, 
the  clause  for  the  Declare  constructor,  which  calls  push  and  pop,  implicitly  changes  the  size  of  the 
stack  and  explicitly  changes  the  size  of  the  index  (nm:  index),  keeping  the  two  in  synch.  It  evaluates 
the  initial  value  for  a  new  variable,  extends  the  index  with  the  variables  name,  and  the  stack  with 
its  value,  and  then  executes  the  body  of  the  Declare.  Afterwards  it  removes  the  binding  from  the 
stack  (using  pop),  all  the  while  implicitly  threading  the  accumulated  output.  The  mapping  is  in 
scope  only  for  the  body  of  the  declaration. 

Second,  the  clause  for  the  While  constructor  introduces  a  local  tail  recursive  function  loop. 
This  function  emulates  the  body  of  the  while.  It  is  tempting  to  control  the  recursion  introduced 
by  the  While  by  using  the  recursion  of  the  interpretl  function  itself  by  using  a  clause  something 
like: 
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While (e , body)  => 

Do  mswo  {  v  <-  evall  e  index  ; 

if  v=0  then  Return  mswo  () 

else  Do  mswo  {  interpretl  body  index  ; 

interpretl  (While (e, body))  index  > 


> 


Here,  if  the  test  of  the  loop  is  true,  we  run  the  body  once  (to  transform  the  stack  and  accumulate 
output)  and  then  repeat  the  whole  loop  again.  This  strategy,  while  correct,  will  have  disastrous 
results  when  we  stage  the  interpreter,  as  it  will  cause  the  first  stage  to  loop  infinitely. 

There  are  two  recursions  going  on  here.  First  the  unfolding  of  the  finite  data  structure  which 
encodes  the  program  being  compiled,  and  second,  the  recursion  in  the  program  being  compiled.  In 
an  unstaged  interpreter  a  single  loop  suffices.  In  a  staged  interpreter,  both  loops  are  necessary.  In 
the  first  stage  we  only  unfold  the  program  being  compiled  and  this  must  always  terminate.  Thus 
we  must  plan  ahead  as  we  follow  our  three  step  process.  Nevertheless,  despite  the  concessions 
we  have  made  to  staging,  this  interpreter  is  still  clear,  concise  and  describes  the  semantics  of  the 
while-language  in  a  straight-forward  manner. 
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4.4  Step  2:  staged  interpreter 

To  specialize  the  monadic  interpreter  to  a  given  program  we  add  two  levels  of  staging  annotations. 
The  result  of  the  first  stage  is  the  intermediate  code,  that  if  executed  returns  the  value  of  the 
program.  The  use  of  the  bracket  annotation  enables  us  to  describe  precisely  the  code  that  must  be 
generated  to  run  in  the  next  stage.  Escape  annotations  allow  us  to  escape  the  recursive  calls  of  the 
interpreter  that  are  made  when  compiling  a  while-program. 

(*  eval2:  Exp  ->  index  ->  <int  M>  *) 
fun  eva!2  exp  index  = 
case  exp  of 

Constant  n  =>  <Return  mswo  "(lift  n)> 

I  Variable  x  => 

let  val  loc  =  position  x  index 
in  <read  "(lift  loc)>  end 
I  Minus (x,y)  => 

<Do  mswo  {  a  <-  "(eval2  x  index)  ; 

b  <-  "(eval2  y  index); 

Return  mswo  (a  -  b)  }> 

I  Greater (x,y)  => 

<Do  mswo  {  a  <-  "(eval2  x  index)  ; 

b  <-  "(eval2  y  index); 

Return  mswo  (if  a  *>*  b  then  1  else  0)  }> 

I  Times (x,y)  => 

<Do  mswo  {  a  <-  "(eva!2  x  index)  ; 

b  <-  "(eval2  y  index); 

Return  mswo  (a  *  b)  }>; 

The  lift  operator  inserts  the  value  of  loc  as  the  argument  to  the  read  action.  The  value  of  loc 
is  known  in  the  first-stage  (compile-time),  so  it  is  transformed  into  a  constant  in  the  second-stage 
(run-time)  by  lift. 

To  understand  why  the  escape  operators  are  necessary,  let  us  consider  a  simple  example:  eval2 
(Minus (Constant  3, Constant  1))  [].  We  will  unfold  this  example  by  hand  below: 

eval2  (Minus (Constant  3, Constant  1))  []  = 

<  Do  mswo 

{  a  <-  "(eval2  (Constant  3)  [] ) ; 
b  <-  "(eval2  (Constant  1)  []); 

Return  mswo  (a-b)}  >  = 

<  Do  mswo 

{  a  <-  "<Return  mswo  3>; 
b  <-  "<Return  mswo  1>; 

Return  mswo  (a  -  b)}  >  = 

<  Do  mswo 

{  a  <-  Return  mswo  3; 
b  <-  Return  mswo  1; 

Return  mswo  (a  -  b)}  >  = 

<  Do  '/.mswo 

{  a  <-  Return  '/.mswo  3; 
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b  <-  Return  '/.mswo  1; 

Return  '/.mswo  (a  '/,-  b)}  > 

Each  recursive  call  produces  a  bracketed  piece  of  code  which  is  spliced  into  the  larger  piece  being 
constructed.  Recall  that  escapes  may  only  appear  at  level-1  and  higher.  Splicing  is  axiomatized 
by  the  reduction  rule:  ~<x>  — >  x,  which  applies  only  at  level-1.  The  final  step,  where  mswo  and  - 
become  '/.mswo  and  '/,-,  occurs  because  both  are  free  variables  and  are  lexically  captured. 

Interpreter  for  Commands. 

Staging  the  interpreter  for  commands  proceeds  in  a  similar  manner: 

(*  interpret2  :  Com  ->  index  ->  <unit  M>  *) 
fun  interpret2  stmt  index  = 
case  stmt  of 

Assign (name ,e)  => 
let  val  loc  =  position  name  index 
in  <Do  mswo  {  n  <~  ~(eval2  e  index)  ; 

write  “(lift  loc)  n  }> 

end 

|  Seq(sl,s2)  => 

<Do  mswo  {  x  <-  ~(interpret2  si  index); 

y  <-  ~(interpret2  s2  index); 

Return  mswo  ()  }> 

I  Cond(e,sl , s2)  => 

<Do  mswo  {  x  <-  ~(eval2  e  index); 
if  x=l 

then  ~(interpret2  si  index) 
else  “(interpret2  s2  index) }> 

I  While(e,body)  => 

<let  fun  loop  ()  = 

Do  mswo  {  v  <-  (eval2  e  index) ; 
if  v=0 

then  Return  mswo  0 

else  Do  mswo  {  q  <-  ~(interpret2  body  index);  loop  ()> 

} 

in  loop  ()  end> 

|  Declare (nm,e, stmt)  => 

<Do  mswo  {  x  <-  “(eva!2  e  index)  ; 
push  x  ; 

~(interpret2  stmt  (nm: : index) )  ; 
pop  }> 

|  Print  e  => 

<Do  mswo  {  x  <-  ~(eval2  e  index)  ; 
output  x  }>; 

4.4.1  An  example. 

The  function  interpret2  generates  a  piece  of  code  from  a  Com  object.  To  illustrate  this  we  apply 
it  to  the  simple  program:  declare  x  =  10  in  {  x  :=x-l;  print  x  }  and  obtain: 

<Do  '/.mswo 

{  a  <-  Return  '/.mswo  10 
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;  '/.push  a 
;  Do  '/.mswo 

{  e  <-  Do  '/.mswo 

{  d  <-  Do  '/.mswo 

{  b  <-  '/.read  1 
;  c  <-  Return  '/.mswo  1 
;  Return  '/.mswo  b  '/,-  c 
> 

;  '/.write  1  d 

> 

;  g  <“  Do  '/.mswo 

{  f  <-  '/.read  1 
;  '/.output  f 
} 

;  Return  */,mswo  () 

} 

;  '/.pop 

» 

Note  that  the  staged  program  is  essentially  a  compiler,  translating  the  syntactic  representation 
of  the  while-program  into  the  above  monadic  object-program  that  will  compute  its  meaning.  Note 
that  in  the  object-program  all  of  the  compile-time  operations  have  disappeared.  This  object- 
program  is  fully  executable.  Simply  by  using  the  run  operator  of  MetaML,  it  can  be  executed  for 
prototyping  purposes. 

5  Step  3:  Back-end  translation  and  intermediate  code  optimiza¬ 
tion 

MetaML  is  a  meta-programming  system.  It  has  an  object  language  and  a  meta-language.  Meta¬ 
programs  are  programs  that  manipulate  object  programs.  In  MetaML  both  the  object  language 
and  the  meta-language  are  ML.  In  MetaML  an  object-program  is  both  a  data  structure  that  can 
be  manipulated,  and  a  program  that  can  be  run. 

This  duality  plays  an  important  role  in  target  code  generation.  The  result  of  applying  the 
staged  interpreter  from  the  previous  step  (a  meta-program)  to  a  DSL  program  to  be  compiled  is  a 
highly  constrained  residual  program  (an  object  program).  This  program  is  both  a  data-structure 
and  a  program,  so  it  can  be  both  directly  executed  (rapid  prototype)  and  analyzed. 

We  use  the  object-code  analysis  capabilities  of  MetaML  to  transform  the  object  program  into 
the  final  target  language.  This  analysis  can  include  both  source  to  source  transformations,  or 
translation  into  another  form  (i.e.  intermediate  code,  assembly  language,  or  target  language). 

Control  over  the  form  of  the  residual  program  is  crucial  here.  The  residual  program  is  always 
an  ML  program  (ML  is  the  object  language).  But  the  user  can  control  the  form  of  this  ML 
program.  A  goal  of  the  translation  is  to  make  the  object  program  use  only  those  ML  features 
directly  supported  by  the  target  language.  For  example,  we  may  structure  the  staged  interpreter 
such  that  the  residual  program  is  first  order,  or  just  a  sequence  of  primitive  actions  encoded  as 
non-standard  morphisms  in  the  monad.  This  is  where  we  connect  the  abstract  monadic  actions  to 
their  efficient  implementations. 

The  object  program  produced  above  is  an  ML  code  fragment.  It  can  be  executed  or  analyzed. 
The  code  produced  by  interpret^  is  a  restricted  subset  of  ML.  Disregarding  the  higher-order 
functions  implicit  in  the  monad,  it  is  first  order,  and  contains  only  Do  expressions,  Return  expres- 
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sions,  if  expressions,  calls  to  the  non-standard  morphisms  read,  write,  push  ,  pop,  and  output, 
primitive  arithmetic  operators  -  and  '  >J,  and  local  looping  functions  (like  loop  above).  The  code 
is  so  regular  that  it  can  be  captured  by  a  simple  grammar.  The  next  step  is  to  analyze  this  code 
to  make  the  final  translation  to  the  target  language,  or  to  apply  some  ML-source  to  ML-source 
level  optimizations.  The  reader  might  notice  that  the  object-program  above  could  be  considerably, 
further  simplified  by  applying  the  monad  laws.  There  are  many  opportunities  for  doing  so.  After 
these  laws  are  applied  we  obtain  the  much  more  satisfying: 

<Do  #/,mswo 

{  '/.push  10 
;  a  <-  '/.read  1 
;  b  <-  Return  '/.mswo  a  */,-  1 
;  c  <-  ’/.write  1  b 

;  d  <-  '/.read  1 

;  e  <-  '/.output  d 

;  Return  */,mswo  () 

J  '/.pop 

» 

In  addition  to  the  monad  laws  which  hold  for  all  monads,  we  can  also  use  laws  which  hold  for 
particular  non-standard  morphisms.  For  instance,  in  the  example  above,  we  could  avoid  the  second 
read  of  location  1  using  the  following  rule: 

Do  {  el;  c  <-  '/.write  1  b  ;  d  <-  '/.read  1;  e2}  =  Do  {  e;  c  <-  '/.write  1  b;  e2[b/d]} 

Every  target  language  will  have  many  such  laws,  and  because  our  target  language  is  both 
executable-code,  and  data-structure  we  can  perform  these  optimizations.  The  final  step  is  to 
translate  the  ML  code  fragment  into  the  target  language.  This  step  uses  the  same  intensional 
analysis  of  code  capabilities  of  the  optimization  steps,  and  is  the  subject  of  the  next  section. 

5.1  Intensional  analysis  of  code  fragments 

In  this  section,  we  outline  how  we  do  intensional  analysis  of  residual  code.  We  provide  a  high-level 
pattern  matching  based  interface.  Code  patterns  can  be  constructed  by  placing  brackets  around 
code.  For  example  a  pattern  that  matches  the  literal  5  can  be  constructed  by: 

- |  fun  is5  <5>  =  true 
I  is5  _  =  false; 
val  is5  =  fn  :  <int>  ->  bool 

-1  is5  (lift  (1+4)); 
val  it  =  true  :  bool 

-|  is5  <0>; 

val  it  =  false  :  bool 

The  function  is5  matches  its  argument  to  the  constant  pattern  <5>  if  it  succeeds  it  returns 
true  else  false.  Pattern  variables  in  code  patterns  are  indicated  by  escaping  variables  in  the  code 
pattern. 

- |  fun  parts  <  ~x  +  ~y  >  =  S0ME(x,y) 

I  parts  _  =  NONE; 


12 


val  parts  =  fn  :  <int>  ->  (<int>  *  <int>)  option 
-|  parts  <6  +  7>; 

val  it  =  SOME  (<6>,<7>)  :  (<int>  *  <int>)  option 
-|  parts  <2>; 

val  it  =  NONE  :  (<int>  *  <int>)  option 

The  function  parts  matches  its  argument  against  the  pattern  <  ~x  +  "y  >.  If  its  argument  is  a 
piece  of  code  which  is  the  sum  of  two  sub  terms,  it  binds  the  pattern  variable  x  to  the  left  subterm 
and  the  pattern  variable  y  to  the  right  subterm. 

We  use  of  higher-order  pattern  variables[22,  21]  for  code  patterns  that  contain  binding  occur¬ 
rences,  such  as  lambda  expressions,  let  expressions,  do  expressions,  or  functions. 

For  example,  a  high-order  pattern  that  matches  the  code  of  a  function  <fn  x  =>  . .  .>,  of  type 
<’a  ->  ’b>  is  written  in  eta-expanded  form  <fn  x  =>  "(g  <x>)>.  When  the  pattern  matches, 
the  matching  binds  the  higher-order  pattern  variable  g  to  a  function  with  type  <’a>  ->  <’b> 
Every  higher  order  pattern  variable  must  be  in  fully  saturated  form,  by  applying  it  to  all  the 
bound  variables  of  the  code  pattern.  For  example  if  g  is  a  higher-order  pattern  variable  with  type 
< ’ a>  ->  <’b>  ->  < ’ c>  then  we  must  write  ("g  <x>  <y>).  The  arguments  to  the  higher-order 
pattern  variable  must  be  explicit  bracketed  variables,  one  for  each  variable  bound  in  the  code 
pattern  at  the  context  where  the  higher-order  pattern  appears.  A  higher-order  pattern  variable  is 
used  like  a  function  on  the  right-hand  side  of  a  matching  construct. 

For  example  functions  which  implement  the  three  monad  axioms  are  written  as  follows: 

fun  monadl  <do  mswo  {  x  <-  return  mswo  "e;  "(z  <x>)  }>  =  z  e 

fun  monad2  <do  mswo  {  x  <-  "m;  return  x  }>  =  m 

fun  monad3  <do  mswo  {  x  <-  do  mswo  fy  <-  “a;  "(b  <y>) } ;  "(c  <x>  }>  = 

<do  mswo  {  y’  <-  ~a;  do  mswo  {  z  <-  “(b  <y’>);  "(c  <z>)  }}> 

When  the  the  function  monadl  is  applied  to  the  code  <do  mswo  {a  <-  returm  mswo  (g  3) ; 
h(a  +  2)}>,  the  pattern  variable  e  is  bound  to  the  function  fn  x  =>  <h("x  +  2)>  which  has  the 
type  <int>  ->  <int  M>.  The  right-hand  side  of  monadl  rebuilds  a  new  code  fragment,  substituting 
formal  parameter  x  of  e  by  <g  3>,  constructing  the  code  <h((g  3)+  2)>. 

This  technique  can  be  used  to  build  optimizations,  or  to  translate  a  residual  program  into  a 
target  language. 

6  Conclusion 

The  important  issues  of  efficient  language  implementation  by  refinement  from  high-level  speci¬ 
fications  are:  the  efficient  use  of  the  underlying  target  environment,  and  removing  the  layer  of 
interpretative  computation  introduced  by  such  specifications.  We  have  shown  that  monads  and 
staging  are  the  right  abstraction  mechanisms  to  accomplish  the  task.  To  effectively  use  these  tools 
we  propose  that  DSL  implementers  follow  a  well  defined  method.  We  reiterate  our  method  here: 

•  Domain  analysis.  The  problem  domain  is  analyzed  to  find  the  common  abstractions  around 
which  the  language  is  designed.  This  step  is  perhaps  the  most  important  step  in  a  good 
language  design.  It  has  been  studied  extensively  by  others  [32,  2,  3].  Our  research  group 
has  been  investigating  the  integration  of  DSL  design  and  domain  analysis  for  several  years. 
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Recently  Widen  and  Hook  have  summarized  a  “top  level”  view  of  this  integration,  which  is 
called  the  Software  Design  Automation  (SDA)  method  [33].  This  method  provides  a  design 
process  and  many  synthesis  techniques  to  facilitate  the  integration  of  traditional  domain 
analysis  activities  with  language  design  and  implementation.  The  method  we  propose  can  be 
used  in  the  context  of  SDA.  It  specifically  addresses  the  language  implementation  phase  of 
the  process. 

•  Definitional  interpreter.  Once  the  language  has  been  identified,  the  next  step  is  to  provide 
it  with  a  semantics  given  as  a  pure  functional  interpreter.  This  program  can  be  thought  of  as 
its  high-level  definition  [14,  25].  high-level  interpreters  are  usually  easy  to  construct  and  pro¬ 
vide  a  reference  which  can  be  consulted  to  resolve  any  ambiguity  in  the  language  specification 
discovered  in  further  steps.  By  building  it  in  an  executable  framework  (a  functional  language, 
such  as  Haskell  or  ML)  it  also  provides  a  rapid  prototype  against  which  expectations  can  be 
measured. 

•  Binding  time  improvements.  The  next  step  requires  a  binding  separation  [8].  By  iden¬ 
tifying  compile-time  versus  run-time  data  structures  in  the  definitional  interpreter,  we  can 
separate  those  with  both  components  into  separate  data-structures.  Examples  of  binding 
time  improvements  include  the  separation  of  environments,  which  map  names  to  values,  into 
a  compile-time  index  and  a  run-time  stack,  and  the  introduction  of  a  local  recursive  func¬ 
tion  to  separate  the  recursion  which  drives  the  analysis  of  the  syntax  of  the  program  being 
interpreted  from  the  recursion  that  encodes  the  looping  of  the  while  command. 

•  Target  domain  analysis.  The  next  step  is  to  analyze  the  target  language  to  identify  the 
primitive  implementation  features  that  will  support  the  translation.  This  step  is  usually 
straight-forward  as  the  target  language  is  often  fixed,  and  well  understood. 

•  Design  a  monad.  The  next  step  is  to  design  a  monad  to  capture  the  effects  and  actions 
implicit  in  the  target  language.  This  is  a  hard  step  in  the  process  since  it  requires  both  abstract 
knowledge  about  the  structure  and  properties  of  monads,  and  detailed  concrete  knowledge 
about  the  target  domain.  The  choices  made  in  this  step  influence  the  structure  of  the  monad, 
the  structure  of  the  monadic  interpreter,  and  the  run-time  system  which  interacts  with  the 
low-level  effects  of  the  target  language. 

Once  the  monad  is  designed,  an  implementation  for  the  monad  as  a  pure  functional  emulation 
must  be  produced.  The  implementation  must  emulate  the  actions  in  a  purely  functional 
setting  by  explicitly  threading  abstract  representations  of  the  actions  such  as  “stores”,  “I/O 
streams”,  or  “exception  continuations”  in  and  out  of  all  computations. 

•  Monadic  Interpreter.  The  next  step  is  to  refine  the  purely  functional  definitional  inter¬ 
preter  into  one  written  in  a  monadic  style  [28,  24,  13].  This  implementation  is  still  purely 
functional  because  the  actions  of  the  monad  are  emulated  in  a  functional  style.  But  because 
the  actions  are  now  explicit,  we  have  moved  the  form  of  definition  closer  to  the  target  lan¬ 
guage.  This  step  often  requires  a  big  change  to  the  structure  of  the  source  code,  because  the 
monad  makes  implicit  much  of  the  “plumbing”  explicit  in  the  interpreter.  The  cost  of  this 
restructuring  is  not  without  benefit.  The  removal  of  the  explicit  plumbing  results  in  programs 
which  are  simpler,  and  more  immune  to  future  changes. 

•  Staging.  The  next  step  completes  the  binding-time  separation  begun  in  the  binding  time 
improvement  step.  That  step  separated  the  compile-time  data  from  the  run-time  data.  Stag¬ 
ing  separates  the  compile-time  computations  from  the  run-time  computations.  This  is  done 
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by  placing  explicit  staging  annotations  in  the  program  written  in  MetaML.  Staging  is  the 
crucial  step  that  differentiates  an  (inefficient)  interpreter  from  an  (efficient)  compiler. 

•  Transformation  of  residual  code. 

The  residual  object-program  produced  by  a  staged  interpreter  is  both  a  data  structure  that  can 
be  manipulated,  and  a  program  that  can  be  run.  Control  over  the  form  of  the  residual  program 
is  crucial  here.  The  residual  program  is  always  an  ML  program  (ML  is  the  object  language). 
But  the  user  can  control  the  form  of  this  ML  program.  A  goal  of  the  translation  is  to  make 
the  object  program  use  only  those  ML  features  directly  supported  by  the  target  language. 
The  restricted  form  of  the  residual  object  program  make  it  possible  to  use  the  intensional 
analysis  of  object-code  tools  provided  by  MetaML  to  easily  build  the  final  translation  step  to 
the  target  language. 

6.1  Benefits  of  the  approach 

This  paper  illustrated  a  step  by  step  method  for  constructing  correct  and  efficient  implementations 
of  DSLs.  The  method  has  the  following  advantages  over  building  a  DSL  implementation  in  an 
ad-hoc  fashion. 

•  Simplicity.  We  divide  the  task  of  DSL  implementation  of  DSL  into  small  manageable  tasks. 
The  compiler  is  constructed  by  a  method  of  refinement,  and  we  use  special  abstraction  mech¬ 
anisms  so  that  each  step  addresses  only  a  single  aspect  of  the  compiler. 

•  Reuse.  Our  method  provides  many  opportunities  for  reuse.  By  using  the  abstraction  meth¬ 
ods  of  monads  and  staging,  much  of  the  code  remains  unchanged  between  refinement  steps. 
In  addition,  monad  implementations  are  reusable  across  DSLs,  and  multiple  DLS  using  the 
same  target  language  can  reuse  the  intensional  analysis. 

•  Control.  Instead  of  using  a  fixed  set  of  techniques  or  tool  to  generate  compilers,  we  outline  a 
method  which  provides  users  control  over  each  step.  A  good  impedance  match  between  low- 
level  features  of  the  target  language  and  the  high-level  DSL  is  necessary  for  good  performance. 
Since  every  compiler  is  different,  users  need  such  fine  grained  control. 

•  Correctness.  The  MetaML  type  system  provides  major  support  for  ensuring  the  correct¬ 
ness  of  the  compilers  generated.  It  is  simply  not  possible  to  write  a  type-incorrect  translation. 
But  type-correctness  is  not  enough.  We  wish  to  prove  other  correctness  properties  as  well, 
such  as  the  equivalence  between  the  artifacts  produced  by  each  step  of  the  method.  We  be¬ 
lieve  that  it  is  possible  for  each  step  to  make  explicit  its  proof  obligations,  and  because  each 
step  produces  a  functional  program,  it  is  possible  to  use  equational  reasoning  to  prove  these 
obligations 

6.2  The  Implementation 

Everything  you  have  seen  in  this  paper,  except  the  higher  order  pattern  matching  over  code,  has 
been  implemented  in  the  MetaML  implementation.  The  examples  are  actual  runs  of  the  system. 

The  higher  order  pattern  matching  is  currently  under  development.  We  found  the  normalizing 
effect  of  the  monad  laws  so  compelling  that  we  implemented  them  in  an  ad-hoc  fashion  inside  the 
MetaML  system. 
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Abstract.  We  introduce  a  technique  to  facilitate  termination  proofs  for  term  rewriting  systems.  We 
especially  focus  on  innermost  termination.  The  main  features  of  this  technique  lie  in  its  simplicity 
and  effectiveness  in  practice.  This  work  can  be  regarded  as  an  application  of  the  general  notion 
termination  through  transformation  to  both  termination  and  innermost  termination  proofs. 


1  Introduction 

It  is  a  highly  significant  question  to  determine  whether  a  term  rewriting  system  (TRS)  is  terminating. 
In  theorem  proving,  TRSs  are  widely  used  for  a  variety  of  purposes.  For  instance,  it  is  often  desirable  to 
transform  a  set  of  equality  rules  into  a  TRS  in  order  to  reduce  the  search  space.  Also  TRSs  can  be  used 
for  proving  the  termination  of  both  functional  and  logic  programs. 

Though  termination  is  an  undecidable  property  of  TRSs  in  general,  there  have  been  many  techniques 
developed  for  facilitating  termination  proofs.  Some  surveys  are  given  in  [Der87,Ste95b].  As  mentioned  in 
[MOZ96],  techniques  for  termination  proofs  can  be  generally  classified  into  two  categories. 

-  Basic  techniques  such  as  various  path  orderings  [Pla78,KL80,Der82],  Knuth-Bendix  ordering  [KB70], 
and  polynomial  interpretations  [Lan79,BL87]  that  apply  directly  to  a  TRS. 

-  Transformational  approaches  which  in  general  transform  a  TRS  into  another  TRS  such  that  the  ter¬ 
mination  of  the  latter  implies  that  of  the  former  and  the  latter  can  be  proven  terminating  more  easily. 
For  instance,  transformation  orderings  [BL90,Ste95a],  semantic  labelling  [Zan95]  and  freezing  [Xi98] 
belong  to  this  category.  Also  the  dependency  pair  approach  [AG97,AG98]  can  be  loosely  classified 
into  this  category  since  it  transforms  a  TRS  into  a  set  of  dependency  pairs. 

There  are  also  various  results  on  modular  termination,  which  basically  give  the  sufficient  conditions  on 
two  terminating  TRSs  that  imply  the  termination  of  their  union.  The  importance  of  modularity  results 
is  evident.  It  is  often  true  that  new  TRSs  are  formed  on  top  of  existing  TRSs.  With  modularity  results, 
it  is  possible  to  reduce  the  termination  of  new  TRSs  to  that  of  the  existing  ones.  In  this  paper,  we 
adopt  a  transformational  approach  for  establishing  some  results  on  modular  termination  and  innermost 
termination.  Given  a  TRS  7 Z,  we  intend  to  split  7 Z  into  the  union  of  TZ\  and  IZ2,  and  then  prove  that  the 
(innermost)  termination  of  7Zi  implies  that  of  TZ  under  some  conditions. 

We  say  that  a  TRS  is  innermost  terminating  if  there  is  no  infinite  innermost  rewriting  sequence  in 
this  TRS.  Roughly  speaking,  innermost  rewriting  means  that  we  can  rewrite  a  term  only  if  all  of  its 
proper  subterms  are  in  normal  form.  To  some  extent,  innermost  rewriting  can  model  the  notion  of  call- 
by-value  evaluation  in  functional  programming,  though  there  are  usually  some  special  rules  for  handling 

*  Partially  supported  by  the  United  States  Air  Force  Materiel  Command  (F19  628-96-C-0161)  and  the  Department 
of  Defense. 
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conditionals.  Also  it  is  proven  in  [AZ95]  that  the  innermost  termination  of  the  TRS  transformed  from  a 
logic  program  implies  the  termination  of  the  logic  program.  Therefore,  the  study  on  innermost  termination 
is  of  significant  relevance  to  the  study  of  termination  of  functional  and  logic  programs.  Moreover,  there 
are  also  various  results  which  relates  innermost  termination  to  termination  [Gra95].  This  allows  us  to 
reduce  termination  to  innermost  termination  for  some  TRSs,  where  the  latter  is  often  easier  to  prove. 

We  now  present  an  example  to  illustrate  the  erasure  technique  before  going  into  further  details.  It 
is  frequent  to  encounter  hierarchical  combination  of  TRSs  when  we  transform  functional  programs  into 
TRSs.  The  simple  reason  is  that  defined  functions  are  used  to  define  new  functions.  For  instance,  the 
following  function  purge  defined  in  ML  [MTHM97]  removes  all  duplicates  from  a  given  (integer)  list  while 
the  function  remove  deletes  all  the  elements  equal  to  some  given  value. 

fun  remove (x,  nil)  =  nil 
I  remove (x,  cons(y,  ys))  = 

if  x  =  y  then  remove (x,  ys)  else  cons(y,  remove (x,  ys)) 
fun  purge (nil)  =  nil 

I  purge (cons (x,  xs))  =  cons(x,  purge (remove (x,  xs))) 

When  proving  termination  of  such  a  functional  program,  the  following  aspect  must  be  taken  into  consid¬ 
eration: 

Usually,  the  programmer  applies  a  semantic  argument  such  as  a  measure  function  in  order  to  show 
that  the  defined  function  is  terminating.  For  example,  the  function  purge  is  terminating  because 
the  length  of  the  list  remove(x,ys)  is  not  greater  than  that  of  ys.  Note  that  it  is  in  general  an 
exceedingly  difficult  task  to  synthesize  such  a  measure  function  from  the  structure  of  a  program. 

The  program  can  be  transformed  into  the  following  TRS  7£pg  l. 

(1)  remove(x,nil)  ->  nil 

(2)  remove(x,  con$(y,ys))  — »  if(x  =  y>  remove (x,  ys),  cons (y,  remove (x,  ys))) 

(3)  purge(nil)  -»  nil 

(4)  purge(cons(x,xs))  -»  cons(x:purge(remove(xixs))) 

It  seems  difficult  to  prove  the  termination  of  this  TRS  with  a  syntactic  approach.  We  can  transform  this 
TRS  into  the  following  TRS  7lpg  with  the  erasure  technique  (ET)  2. 


(!') 

nil  -*»  nil 

(2.1') 

cons(y,ys)  ys 

(2.2') 

cons(y,y$)  cons(y,ys) 

(3') 

purge  (nil)  — >  nil 

(4')  purge(cons(x,xs ))  -4  cons (x,  purge  (xs)) 

In  this  case,  we  project  a  term  beginning  with  remove  to  the  second  argument  of  remove  and  a  term 
beginning  with  if  to  either  the  second  or  the  third  argument  of  if.  Under  the  recursive  path  ordering 
RPO  with  the  precedence  purge  y  cons ,  the  rules  (2.1'),  (3')  and  (4')  can  be  strictly  ordered  and  the  rules 
(l7)  and  (2.2')  can  be  ordered.  We  now  informally  argue  that  TZpg  is  terminating.  Suppose  that  there  is 
an  infinite  innermost  7£pg -rewriting  sequence.  We  will  show  that  this  sequence  induces  an  infinite  7^pg- 
rewriting  sequence.  We  then  observe  that  this  induced  sequence  cannot  have  infinitely  many  applications 
of  those  strictly  ordered  rules.  Therefore,  there  is  an  infinite  7£pg-rewriting  sequence  in  which  only  applied 
rules  are  either  (!')  or  (2.2').  We  will  then  prove  this  implies  that  there  is  an  infinite  innermost  7£pg- 
re writing  sequence  in  which  the  only  applied  rules  are  either  (1)  or  (2).  This  is  a  contradiction  since  the 

1  We  omit  the  rules  involving  =  and  if  at  this  moment. 

2  The  following  is  slightly  different  from  the  actual  application  of  ET  for  the  purpose  of  presentation. 
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TRS  consisting  of  rules  (1)  and  (2)  is  easily  proven  to  be  terminating.  Therefore,  we  conclude  that  72,  is 
innermost  terminating.  This  argument  will  be  substantiated  in  Section  3. 

As  already  mentioned,  most  of  the  programmers  use  semantic  arguments  to  prove  termination.  This 
is  a  powerful  and  flexible  approach  but  it  is  also  too  semantic  to  be  largely  automated.  On  the  other 
hand,  the  limited  erasure  technique  is  syntactic,  and  thus  it  is  reasonable  to  expect  that  this  approach 
can  be  combined  with  other  approaches  such  as  the  freezing  technique  to  facilitate  automatic  innermost 
termination  proofs.  However,  we  observe  in  practice  [SX98]  that  it  is  even  questionable  to  scale  an 
approach  as  simple  as  RPOS,  not  mentioning  other  more  involved  techniques.  Therefore,  we  expect  that 
a  more  promising  direction  is  to  apply  the  erasure  technique  interactively.  We  shall  make  this  point  more 
clear  with  concrete  examples. 

This  paper  is  organized  as  follows.  In  Section  2,  we  briefly  explain  the  notations  and  introduce  some 
basic  concepts.  We  present  the  erasure  technique  (ET)  for  innermost  termination  proofs  in  Section  3  and 
establish  the  correctness  of  ET.  This  section  constitutes  the  main  contribution  of  the  paper.  We  then 
mention  some  closely  related  work  and  conclude.  We  also  present  some  examples  in  Appendix  A,  which 
can  be  of  some  assistance  for  the  reader  to  understand  the  presented  work  if  necessary. 


2  Preliminaries 

In  general,  we  stick  to  the  notations  in  [DJ91]  though  some  minor  modifications  may  occur.  We  briefly 
summarize  the  notations  and  develop  some  concepts  needed  later. 

2.1  Basics 

We  fix  a  countably  infinite  set  X  of  variables  a;,?/, . . .  and  use  T  for  a  (finite)  set  of  function  symbols 
f,g, . . ..  Note  that  every  function  symbol  /  is  of  a  fixed  arity  Ar(f)  and  /  is  a  constant  if  Ar(f)  =  0. 
We  assume  that  there  is  at  least  one  constant  in  T.  Let  T[T ,  X)  denote  the  set  of  terms  over  T  and  X , 
and  T{T)  for  the  set  of  ground  terms  over  T.  Given  a  term  t ,  Var(t )  is  the  set  of  variables  that  occur 
in  t.  We  use  Z  — ►  r  for  a  rewrite  rule,  where  we  require  Var(r)  C  Var(Z).  We  use  a  for  substitutions  and 
dom(a)  for  its  (finite)  domain.  Also  ta  stands  for  the  result  of  applying  a  to  t. 

Definition  1.  Contexts  C  are  defined  as  follows. 

1.  []  is  a  context,  and 

2.  f{t\ , . . . ,  U-i ,  C,  U+ 1 , . . . ,  tn)  is  a  context  if  Ar(f)  =  n  and  C  is  a  context. 

C[t]  is  the  term  obtained  from  replacing  the  hole  []  in  C  with  term  t. 

A  TRS  7 Z  over  T  is  a  set  of  rewrite  rules  over  T(T,  X).  A  function  symbol  /  is  an  72,-defined  function 
if  /  is  the  root  symbol  of  l  for  some  rewrite  rule  Z  -4  r  in  71,  and  /  is  a  72,-constructor  if  it  is  not  an 
72,-defined  function.  We  often  use  c  for  constructors. 

Given  a  TRS  1Z ,  we  write  t\  -4^  t<i  if  ti  =  C[la ]  and  t<i  —  C[ra }  and  l  -4  r  is  a  rewrite  rule  in  1Z ,  and 
we  may  also  write  t\  -4^  t2/{C,l  -4  r,a)  to  make  this  explicit.  A  term  t  is  in  72,-normal  form  if  there 
exists  no  tf  such  that  t  —>tz  tf  holds.  If  l  -4  r  E  1Z  and  all  proper  subterms  of  la  are  in  72,-normal  form,  we 
say  ti  =  C[la]  rewrites  to  ti  =  C[ra }  through  innermost  rewriting,  and  we  use  -4^  for  such  a  rewriting 
relation.  Also  we  use  t  -40/1  t'  to  mean  that  either  t  =  t'  or  t  -4  £'. 

We  use  -4*  for  the  transitive  and  reflexive  closure  of  a  relation  -4.  72,  is  (innermost)  terminating  if 
there  exists  no  infinite  (innermost)  72,-rewriting  sequence.  Given  a  substitution  cr,  a  is  72-normal  if  a(x) 
is  in  72,-normal  form  for  every  x  €  dom(a).  The  following  definition  is  less  standard. 

Definition  2.  Given  a  term  t,  t  is  skeleton  IZ-normal  if  we  always  obtain  terms  in  IZ-normal  form  by 
replacing  occurrences  of  variables  in  t  with  terms  in  IZ-normal  form.  Note  that  we  do  not  have  to  replace 
occurrences  of  the  same  variable  with  the  same  terms.  Similarly,  t  is  skeleton  IZ-terminating  if  we  always 
obtain  IZ-terminating  terms  by  replacing  occurrences  of  variables  in  t  with  IZ-terminating  terms. 


3 


We  have  the  following  limited  method  to  construct  skeleton  7^-normal  terms. 

Proposition  1.  Let  71  be  a  TRS. 

L  Every  variable  is  skeleton  IZ-normal 

2.  c(£i is  skeleton  7Z-normal  if  c  is  an  1Z- constructor  and  U  are  skeleton  IZ-normal  for  i  = 

Proof  This  is  straightforward  by  the  definition.  ■ 

In  other  words,  ^-constructor  terms,  that  is,  terms  constructed  from  ^-constructors  and  variables,  are 
skeleton  7£-normal.  Similarly,  ^-constructor  terms  are  also  ^-terminating. 

We  use  the  notation  y  for  a  quasi  ordering  and  y  for  the  strict  part  of  y.  A  reduction  ordering  is 
an  ordering  y  such  that  its  strict  part  y  is  well-founded  and  both  y  and  y  are  compatible  with  the 
term  structure  and  stable  under  substitutions.  One  of  the  most  well-known  and  widely  used  reduction 
orderings  is  the  recursive  path  ordering  RPOS  with  status  [Der82,KL80].  Please  see  [Ste95b]  for  further 
details. 

Remark  1.  We  say  that  a  rewrite  rule  l  — >  r  is  strictly  ordered  under  y  if  l  y  r,  and  /  — >  r  is  ordered  if 
I  y  r. 

2.2  Hierarchical  Combination 

Definition  3.  Given  two  TRSs  7Z\  and  7^2,  we  say  7Zi  and  7^2  form  a  hierarchical  combination  7Z\  UR2 
if  no  defined  function  symbols  in  72-2  have  appearances  in  .  Given  a  term  t,  a  subterm  of  t  is  called  an 
7l2-subterm  if  the  root  symbol  of  the  subterm  is  a  R2- defined  function  symbol. 

Notice  that  hierarchical  combination  occurs  naturally  when  we  transform  functional  programs  into  TRSs: 
defined  functions  are  used  to  define  new  functions. 

We  omit  the  proof  of  the  following  lemma  since  it  is  really  a  bit  of  folklore  in  term  rewriting. 

Lemma  1.  Suppose  that  two  TRSs  7Z\  and  K2  form  a  hierarchical  combination  1Z.  We  have  the  following. 

1.  If  all  7Z2-subterms  oft  are  in  IZ-normal  form  and  t  — t' ,  then  all  IZ2  -subterms  oft '  are  in  7Z-normal 
form. 

2.  If  7Zi  is  terminating  and  all  R2  -subterms  of  t  are  in  IZ-normal  form ,  then  t  is  (innermost)  7Z- 
terminating,  that  is,  there  is  no  infinite  (innermost)  IZ-rewriting  sequence  from  t. 

In  the  following  presentation,  we  may  omit  the  prefix  “7 1-”  if  it  is  irrelevant  or  it  is  clear  from  the 
context  which  7Z  we  refer  to. 

2.3  Erasure 

Generally  speaking,  ti  is  an  erasure  of  t2  if  h  can  be  obtained  from  erasing  some  function  symbols  and 
subterms  in  £2*  In  other  words,  t\  embeds  into  <2-  However,  it  will  soon  be  clear  that  some  embedding 
may  not  be  erasure. 

For  every  function  symbol  /  in  T  with  arity  n,  we  associate  with  it  the  following  rewrite  rules  for 
i  —  1 , ...  ,n. 

(f-o-i)  f  (#i  5  ■  •  •  *  &n)  ^  f  {pO  1 )  •  ♦  •  ?  5  •  •  ■  >  ^n) 

(/- p-i)  fix i,...,zn)  Xi 

An  [f-o-i)  rule  is  called  an  omitting  rule  and  an  (f-p-i)  a  projection  rule.  Both  of  these  rules  are  called 
erasure  rules.  Notice  that  an  (/- o-z)  rule  changes  the  arity  of  /.  Also  we  say  that  [f-p-i)  is  not  argument¬ 
dropping  if  Ar[f)  =  1.  All  other  erasure  rules  are  argument-dropping. 

Given  a  set  S  of  erasure  rules  in  which  there  is  at  most  one  rule  associated  with  /  for  every  /  €  T, 
we  call  S  an  erasure  TRS.  The  5-erasure  of  t  is  the  5-normal  form  of  t,  which  is  alternatively  defined  as 
follows. 
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Definition  4.  Given  an  erasure  TRS  S,  we  use  |t|s  for  the  S-erasure  of  t  and  e(t)$  for  the  set  of  terms 
erased  from  t.  In  general,  we  omit  the  subscript  S  if  there  is  no  risk  of  confusion. 


(t 


1*1  H 


/(|*l|j  •  •  •  i  |*i— lli  I^J+l|»  •  •  ■  >  l*n|) 

M 


if  t  is  a  variable; 

if  t  =  /(ti,.  and  (f-o-i)  £  S; 

if  t  =  f{t i, . and  (f-p-i)  6  S; 
if  t  —  /(ti,. . .  ,tn)  and  otherwise. 


e(t)  =  { 


0 

{*i}UUj€{l . n}\{i}  £(fi) 

{*1  > ■  * '  >  *i*~l  >  *i+l  vi  tn  }  U  c(tj) 
Uj€{l . n}e(^') 


if  t  is  a  variable ; 

if  t  =  /(ti,. and  (f-o-i)  £  S; 
i/  t  =  f(ti ,...,tn)  and  (f-p-i)  e  S; 
if  t  =  /(<i,. . . ,  £n)  and  otherwise. 


The  erasure  of  rule  l  -►  r  is  |/|  -4  |r|,  and  the  erasure  of  71  is  defined  analogously.  Note  that  the  erasure 
of  a  rewrite  rule  may  not  always  be  a  legal  rewrite  rule.  For  instance,  the  5-erasure  of  if  (false  ,x,  y)  -¥  y 
is  x  -4  y  for  S  =  {(if’ p-2)},  which  is  illegal.  Similarly,  the  erasure  of  a  TRS  may  not  be  a  legal  TRS. 

The  erasure  \C\  of  a  context  C  can  be  defined  in  a  straightforward  manner.  However,  \C\  may  not  be 
a  context  since  the  hole  []  in  C  may  be  erased  away.  In  this  case,  we  write  \C\[t]  simply  for  \C\.  Given 
a  substitution  cr,  its  erasure  \a\  is  a  substitution  with  the  same  domain  and  \a\ (x)  =  |<j(x)|  for  every 
x  £  dom(cr). 


Proposition  2.  Given  a  context  C,  a  term  t  and  a  substitution  a,  we  have  [C^]!  =  |C|[|t|]  and  \ta\  = 

MM. 


Proof  This  is  straightforward  from  a  structural  induction  on  C  and  t ,  respectively.  ■ 

Lemma  2.  Suppose  that  the  erasure  [JZ\  of  a  TRS  TL  is  also  a  TRS.  If  t\  —>7 z  t2,  then  |£i|  \t2\. 

Proof  Assume  t\  =  C[la]  and  t2  =  C[ra ]  for  some  cr,  where  /  -4  r  £  71.  If  \C\  is  not  a  context,  then 
|*i|  =  |C|  =  \t2\.  Otherwise,  |*i|  =  |C,|[|Z||a|]  and  \t2\  =  |C|[|r||cr|]  by  Proposition  2.  Since  |Z|  -4  \r\  £  |7£|, 
we  have  j*i|  ->\n\  Clearly,  if  \C\  is  a  context,  then  |£i|  ->\n\  |^|-  ■ 

Note  that  for  every  /  £  T  with  arity  n,  we  can  introduce  the  following  omitting  rule,  where  1  <  i\  < 
. . .  <  ik  <  n. 

(/~0-(zi,  •  •  •  ,  ik))  f(x  1)  •  *  ■  j  *^n)  ^  /(^  1  j  •  •  •  j  1  j  •  •  *  ?  •  •  ■  >  ®n) 

In  other  words,  this  rule  drops  the  subterms  of  /(ii,...,in)  at  the  positions  zi, .  • . ,  i*.  This  rule  is 
argument-dropping.  Note  that  this  is  a  single  rule,  which  should  not  be  regarded  as  a  combination  of 
several  omitting  rules.  Also  it  should  be  clear  that  all  the  previous  results  involving  erasure  still  hold  in 
the  presence  of  such  omitting  rules. 

3  Erasure  for  Termination  Proofs 

The  erasure  technique  (ET)  is  mainly  to  facilitate  modular  innermost  termination  proofs  for  TRSs. 
Notice  that  innermost  termination  implies  termination  for  overlay  TRSs  [Gra95],  and  therefore  this  can 
also  facilitate  (classical)  termination  proofs.  We  also  show  that  ET  can  be  directly  applied  to  (classical) 
termination  proofs.  The  essential  idea  behind  ET  is  simulation  as  presented  in  [Xi98].  In  general,  ET  can 
be  regarded  as  an  application  of  the  notion  termination  through  transformation  to  both  termination  and 
innermost  termination  proofs. 
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3.1  Elementary  Versions  of  ET 

In  this  section,  we  establish  some  elementary  versions  of  the  erasure  technique. 

Definition  5.  Given  a  TRS  71,  we  say  that  M  r  6  R  has  a  conservative  erasure  if  |Z|  -4  |r|  is  a  legal 
rewrite  rule  and  t  is  skeleton  TZ-normal  for  every  t  G  e(r),  that  is,  all  subterms  erased  from  r  are  skeleton 
71-normal  If  all  the  rules  in  71  have  conservative  erasures,  then  we  say  71  has  a  conservative  erasure  \7Z\. 

The  next  theorem  is  the  most  elementary  one  among  those  for  ET  which  we  will  formulate  and  prove. 
Nonetheless,  this  theorem  has  largely  captured  the  essential  idea  behind  ET. 

Theorem  1.  Assume  that  71  =  7£i  U  7l2  has  an  erasure  TV  =  7Z[  U7Z2,  where  lVi  are  the  conservative 
erasures  of  7li  for  i  —  1,2.  Also  assume  that  under  some  reduction  ordering,  every  rule  in  7Z[  can 
be  ordered  and  every  rule  in  7 V2  can  be  strictly  ordered,  then  the  innermost  termination  of  7Zi  implies 
the  innermost  termination  of  7Z.  In  the  case  where  all  erasure  rules  are  not  argument- dropping,  the 
termination  ofTZ\  implies  that  of  71. 

Proof  Suppose  that  there  exists  an  infinite  innermost  ^-rewriting  sequence  as  follows: 

*i  -±n  *2  tn  —>ti  *  *  * 

where  U  -4  U  -4  r*,  cr*)  for  some  context  C* ,  rule  U  -4  T{  G  71  and  substitution  cr*.  We  show  that 

there  is  an  infinite  innermost  TZi  -rewriting  sequence. 

Obviously,  we  can  require  that  all  proper  subterms  of  ti  be  in  7^-normal  form  since  we  are  handling 
innermost  rewriting.  This  implies  that  all  terms  in  e(ti)  are  in  7£-normal  form.  We  now  show  inductively 
that  this  is  true  for  all  ti  (z  =  1,2,...)  by  analyzing  the  difference  between  e(ti)  and  e(ti+i).  Let  t  G  e(ti+ 1) 
and  we  have  the  following. 

-  t  is  in  e(ti).  Then  t  is  in  7^-normal  form  by  induction  hypothesis. 

—  t  is  not  in  e(U).  Note  ti  =  C{[li(Ti]  and  ti+\  =  Ci[r{cri ].  If  t  contains  r^,  then  there  must  be  some  s 
in  e(ti)  such  that  s  -4  t.  This  is  impossible  since  all  terms  in  e(ti)  are  in  7^-normal  form.  Otherwise, 
t  is  dropped  from  r;<7f.  This  means  that  t  either  equals  sc u  for  some  s  G  e(r)  or  t  is  a  subterm  of 
a i(x)  for  some  x  G  dom(cri).  In  the  latter  case,  t  is  obviously  in  7R-normal  form  since  this  is  innermost 
rewriting.  In  the  former  case,  t  is  in  7^-normal  form  since  s  is  skeleton  7R-normal  (note  that  7V  is  a 
conservative  erasure  of  TV)  and  a  is  an  7R-normal  substitution. 

Therefore,  for  i  =  1,2,...,  all  terms  in  e(U)  are  in  7^-normal  form.  By  Lemma  2,  we  have  the  following. 

1*1 1  1*2 1  -^l - >n  |*n|  “^1  *  •  * 

We  now  show  that  every  -4^  step  in  this  sequence  is  actually  a  ->\n\  step.  It  suffices  to  show  that  \Ci\  is 

always  a  context  for  z  =  1,2, _ Suppose  that  \C{\  is  not  a  context.  Then  l{Oi  is  a  subterm  of  some  term 

in  e{ti).  This  is  impossible  since  all  terms  in  t(U)  are  in  7^-normal  form.  This  implies  that  we  actually 
have  the  following. 

I*i I  ->\n\  1*2 1  l*n|  ->\n\  *•* 

Since  all  rules  in  7Z[  are  ordered  and  all  rules  in  7V2  are  strictly  ordered,  there  must  be  an  n  such  that 
all  the  rules  applied  after  |£n|  are  in  7 Z[.  This  implies  that  all  the  rules  applied  in  the  infinite  innermost 
7R-rewriting  sequence  after  tn  are  in  7Z\,  that  is,  we  have  an  infinite  innermost  7Ri-rewriting  sequence. 
Therefore,  the  innermost  termination  of  7Z\  implies  that  of  71. 

We  now  prove  the  second  part  of  the  theorem.  Suppose  that  all  the  erasure  rules  are  not  argument¬ 
dropping.  Then  \C\  is  a  context  for  every  context  C.  Therefore,  t\  t2  implies  |£i|  \t2\  for  every 

pair  of  terms  ti  and  t2.  With  the  same  argument  as  before,  we  can  show  that  an  infinite  7R-rewriting 
sequence  induces  an  infinite  7l\ -rewriting  sequence.  Therefore,  the  termination  of  7Z\  implies  that  of  71.  ■ 
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Notice  that  we  assume  no  relation  between  7ZX  and  7^2  in  Theorem  1.  This  is  an  attractive  feature  in 
practice.  Suppose  that  we  intend  to  prove  the  termination  of  TZ.  We  proceed  to  find  a  conservative  erasure 
TV  of  71  such  that  all  rules  in  TV  can  be  ordered  under  some  reduction  ordering.  If  there  are  rules  in  TV 
which  can  be  strictly  ordered,  we  remove  them  and  use  7Z[  for  the  set  of  remaining  rules.  We  can  then 
find  7ZX  C7Z  such  that  7 Z[  is  the  conservative  erasure  of  7ZX .  In  this  way,  we  have  reduced  the  innermost 
termination  of  71  to  that  of  7ZX .  If  7ZX  is  empty,  then  we  have  proven  that  71  is  innermost  terminating. 
Clearly,  there  is  no  need  for  splitting  71  before  applying  Theorem  1. 

The  following  TRS  7£wt  is  taken  from  [AG98].  Note  that  m,n  are  variables,  ::  is  the  infix  operator  for 
cons,  []  for  nil  and  [n]  for  cons(n ,  nil).  The  function  weight  computes  a  weighted  sum  of  natural  numbers: 
weight(n o  ::  nx  ::  •  •  ■  ::  n*  ::  nil)  =  no  +  *  n{. 

(1)  sum(s(m)  ::  x,n  ::  y)  -4  sum(m  ::  x,s(n)  ::  y) 

(2)  sum( 0  ::  x,y)  -4  sum(x,y) 

(3)  sum(\},y)-*y 

(4)  weight  {[n])  -4  n 

(5)  weight(m  ::  n  ::  x)  -4  weight(sum(m  ::  n  ::  x,  0  ::  a;)) 

The  last  rule  is  self-embedding,  and  therefore  the  TRS  cannot  be  proven  terminating  with  a  simplification 
ordering.  Intuitively,  7£wt  is  terminating  because  the  length  of  sum(m  ::  n  ::  x,  0  ::  x)  is  less  than  that  of 
m  ::  n  ::  x.  We  can  use  the  erasure  TRS  S  —  {(sum-p-2),(s-p-l)}  to  capture  this.  The  following  TRS  7 Z^ 
is  the  <S-erasure  of  7£wt. 

(T)  n::y^n::y 

(2')  y  -4  y 

(3;)  2/4i/ 

(4')  weight([n])  -4  n 

(5')  weight(m  ::  n  ::  x)  4-  ::  x) 

Notice  that  this  is  a  conservative  erasure.  For  instance,  let  r  be  the  right-hand  side  of  rule  (5),  then  e(r)s 
is  {m  ::  n  ::  x },  in  which  the  term  is  skeleton  T^t-normal.  Clearly,  7Z^t  can  be  ordered  under  a  RPO. 
Since  the  rules  (4')  and  (5')  are  strictly  ordered,  we  delete  them.  Therefore,  the  innermost  termination 
of  7£sum,  which  consists  of  the  rules  (1),  (2)  and  (3),  implies  that  of  7£wt  by  Theorem  1.  The  termination 
of  7£sum  is  readily  proven  with  a  RPOS,  and  thus  7Zy,t  is  innermost  terminating.  In  this  case  7£wt  is 
terminating  since  it  is  an  overlay  (actually  non-overlapping)  TRS. 

On  the  other  hand,  if  we  can  split  a  TRS  into  some  hierarchical  combination,  then  we  can  take 
advantage  of  Theorem  2  below,  which  is  a  generalized  version  of  Theorem  1.  We  first  present  a  definition 
very  close  to  Definition  5. 

Definition  6.  Let  TZ  be  the  hierarchical  combination  of  7ZX  and  7Z2.  We  say  that  l  r  E  71  has  an 
7Z2 -conservative  erasure  if  \l\  -4  |r|  is  a  legal  rewrite  rule  and  t  is  skeleton  7Z2 -normal  for  every  t  E  e(r). 
If  all  rules  in  TZ  have  7Z2-conservative  erasures ,  then  we  say  71  has  an  7Z2 -conservative  erasure  \7Z\. 

Theorem  2.  Let  S  be  an  erasure  TRS  and  71  be  the  hierarchical  combination  of  7ZX  and  7Z2  =  7Z2X  U7^22 
such  that  \7ZX\  and  \1Z2\  are  7Z2- conservative.  Assume  that  under  some  reduction  ordering ,  the  erasure  of 
every  rule  in  7ZX  and  7Z2X  can  be  ordered  and  the  erasure  of  every  rule  in  7Z22  can  be  strictly  ordered, 
then  the  innermost  termination  of  7ZX  U  7^21  implies  the  innermost  termination  of  TZ.  In  the  case  where 
all  erasure  rules  in  S  are  not  argument- dropping,  the  termination  of  7ZX  \J7Z2x  implies  the  termination  of 
TZ. 

Proof  This  is  very  similar  to  the  proof  of  Theorem  1.  Suppose  that  there  exists  an  infinite  innermost 
7^-re writing  sequence  as  follows: 


tx  —$7i  t2  ->n  *  *  *  *4 -ft  tn  -47 z 
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where  ti  — >•  ti+i/(CiJi  — »  r*,^)  for  some  context  Ci,  rule  U  Ti  £  1Z  and  substitution  We  show  that 
there  is  an  infinite  innermost  1Z\ -rewriting  sequence. 

Obviously,  we  can  require  that  all  proper  subterms  of  ti  be  in  7£-normal  form  since  we  are  handling 
innermost  rewriting.  This  implies  that  all  terms  in  e(£i)  are  in  7^2-normal  form.  We  now  show  inductively 
that  this  is  true  for  all  t{  ( i  =  1,2,...)  by  analyzing  the  difference  between  e(ti)  and  e(U+i).  Let  t  G  e(£i+ 1) 
and  we  have  the  following. 

—  t  is  in  e(£*).  Then  t  is  in  7^2-normal  form  by  induction  hypothesis. 

—  t  is  not  in  e(^).  Note  U  =  Ci[kai ]  and  U+ 1  —  Cifco*]-  If  t  contains  r*crj,  then  there  must  be  some  s  in 
e(ti)  such  that  s  ->  t.  Note  U  ->  r*  cannot  be  in  IZ2  since  s  must  be  in  T^-normal  form.  Thus,  t  is  in 
7^2-normal  form  by  Lemma  1.  Otherwise,  t  is  dropped  from  ri<j{.  This  means  that  t  either  equals  scr* 
for  some  s  G  e(r)  or  t  is  a  subterm  of  <Ti{x)  for  some  x  G  dom(<Tj).  In  the  latter  case,  t  is  obviously  in 
7^2-normal  form  since  this  is  innermost  rewriting.  In  the  former  case,  t  is  in  ^-normal  form  since  s  is 
skeleton  7£2~normal  (note  that  R!  is  a  ^-conservative  erasure  of  1Z)  and  a  is  an  7^2-normal  (actually 
7£-normal)  substitution. 

Therefore,  for  i  =  1,2,...,  all  terms  in  e(ti)  are  in  ^-normal  form.  If  k  -»  r<  G  7£2j  then  \C{\  must  be 
a  context  since  li&i  would  be  a  subterm  of  some  t  G  e(£i)  otherwise,  which  contradicts  that  all  terms  in 
e(U)  are  7£2-normal.  Thus,  if  U  ->n2  h  then  |*i|  ->\n2\  Since  all  rules  in  |ft22|  are  strictly  ordered 
under  some  reduction  ordering  and  all  rules  in  \7Zi\  U  |7£2i|  are  ordered,  there  must  bean  such  that  for 
all  i  >  n,  k  ->•  ri  &  H22 •  This  implies  that  we  have  an  infinite  innermost  7^-rewriting  sequence  in  which 
all  applied  rules  are  either  from  1Z\  or  1Z2\ .  Contrapositively,  the  innermost  termination  of  1Z\  U  %2\ 
implies  that  of  1Z . 

The  second  part  of  this  theorem  is  really  the  same  as  that  of  Theorem  1.  We  thus  omit  the  details.  ■ 

We  now  present  an  application  of  Theorem  2.  The  following  example  is  taken  from  the  technical  report 
version  of  [AG97]. 

Let  R\  be  a  TRS  consisting  of  the  following  rules, 

Ze(0,y)  —>  true  pred(s(x))  — »  x 

le(s(x),  0)  -*  false  minus(: r,0)  x 

le(s(i r),  s(y))  — »  Ze(x,  y)  minus(x ,  s(y))  — >  pred(minus(x ,  y)) 

and  1Z2  be  a  TRS  consisting  of  the  following  rules. 

(1)  gcd(0,y)^0 

(2)  gcd(s(x),0)  0 

(3)  gcd(s(x),  s(y))  ->  ifgcd(le(y,  x),  s(x),s(y)) 

(4)  ifgcd(true,$(x),s(y))  ->  gcd (minus (x,  y),  s(y)) 

(5)  ifgcd(false,  $(x),  s(y))  gcd(minus(y,x),s(x)) 

Let  1Z  =  IZi  U  7^2-  72.  is  clearly  a  hierarchical  combination  of  1Z\  and  1Z2  •  We  form  a  5-erasure  TV  of  1Z 
as  follows  for  S  =  {(pred  —p  -  1),  (minus-p  —  1),  ( ifgcd—o  —  1)}.  TV  =  1UX  U  1Z'2 ?  where  1Z[  consists  of  the 
following  rules 

Ze(0,y)  — >•  true  s(x)  — >  x 

le(s(x),0)  false  x  -»•  x 

le(s{x ),  5(2/))  -4  Ze(s, y)  x  -*  x 

and  1Z2  consists  of  the  following  rules. 


(1') 

gcd(0,y) 

0 

(2') 

gcd(s(x),  0) 

0 

(3') 

gcd(s(x),s(y)) 

-> 

ifgcd(s(x),s(y)) 

(4') 

ifgcd(s(x),s(y)) 

-+ 

gcd(x,s(y)) 

(S') 

ifgcd(s(x),s(y)) 

gcd(y,s(x)) 
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It  can  be  readily  verified  that  7 Zf  is  a  IZ2 -conservative  erasure  of  1Z.  Under  the  RPO  with  the  precedence 
relation  gcd  «  ifgcd  and  le  y  true,  false,  all  the  rules  in  1Z [  and  the  rule  (3')  can  be  ordered,  and  the  rules 
(l'),  (2'),  (4')  and  (5')  can  be  strictly  ordered.  By  Theorem  2,  the  innermost  termination  of  1Z\  U  {(3)} 
implies  that  of  1Z.  Since  IZi  U  {(3)}  can  be  easily  proven  terminating  with  a  RPOS,  1Z  is  innermost 
termination.  Note  that  1Z  is  a  non-overlapping  TRS  and  thus  1Z  is  terminating. 

In  practice,  we  may  encounter  the  case  where  1Z\  =  0  when  we  apply  Theorem  1,  or  IZ21  —  0  when  we 
apply  Theorem  2.  Let  use  consider  a  concrete  example.  The  TRS  1Z  consist  of  the  following  single  rule. 

f(g(x))^g(f(f(x))) 

If  we  form  the  5-erasure  of  1Z  for  S  =  {(/-p-1)},  we  obtain  the  following  TRS  \1Z\, 

g(x)  ->  g{ x) 

which  cannot  be  strictly  ordered  under  any  reduction  ordering.  Therefore,  if  we  apply  Theorem  1,  we 
make  no  progress.  However,  we  can  argue  that  1Z  is  terminating  as  follows.  Suppose  that  there  is  an 
infinite  ^-rewriting  sequence: 

h  h  *  ■  •  ~>7e  in  *  *  • 

We  can  choose  tx  such  that  all  proper  subterms  of  ti  are  ^-terminating  and  t  is  ^-terminating  for  every  t 
if  \t\  is  a  subterm  of  tx.  Then  ti  must  be  of  form  f(s).  Since  s  is  ^-terminating,  there  is  some  tn  =  f{g(s')) 
such  that  s  g(sf )  and  tn+ 1  =  g(f(f(sf))).  It  is  clear  that  |*i|  =  \tn+i\  =  d(W\)-  Given  the  property 
of  t\,  we  know  that  s'  is  ^-terminating.  This  implies  that  £n+ 1  is  1Z- terminating,  contradicting  that  the 
above  7l-rewriting  sequence  is  infinite.  Therefore,  there  exists  no  infinite  ^-rewriting  sequence,  that  is, 
1Z  is  terminating.  We  present  a  formalization  of  this  idea  as  follows. 

Definition  7.  Let  S  be  an  erasure  TRS.  Given  a  simplification  ordering  on  terms  and  a  quasi  prece¬ 
dence  relation  y  on  a  finite  set  of  function  symbols ,  we  can  define  a  (strict)  ordering  >-2  as  follows.  Given 
s  and  t,  s  y  2  t  if  either  |s|  yx  \t\,  or  |$|  yx  |t|  and  s  and  t  are  of  form  f(sx, . . . ,  sm)  and  t  =  g(tx, . . . ,  tn), 
respectively ,  and  f  y  g  and 

—  there  is  no  erasure  rule  in  S  is  associated  with  g,  or 

—  the  erasure  rule  in  S  associated  with  g  is  an  omitting  rule ,  or 

—  (g-p-i)  €  S  and  s  y^U. 

Lemma  3.  The  ordering  >^2  defined  in  Definition  7  is  well-founded  and  stable  under  substitutions. 

Proof  This  is  straightforward  since  yx  is  well-founded  and  stable  under  substitutions  and  y  is  well- 
founded. 


Theorem  3.  Let  S  be  an  erasure  TRS  and  1Z  be  the  hierarchical  combination  of  1Z\  and  IZ2  such  that 
\1ZX  |  and  [JZ2 1  are  IZ2  - conservative ,  and  the  erasure  of  every  rule  in  1ZX  and  IZ2  can  be  ordered  under  some 
simplification  ordering  yx.  Let  y  be  a  quasi  precedence  relation  on  a  finite  set  of  function  symbols ,  and 
we  form  an  ordering  >^2  as  described  in  Definition  7.  Assume  that  for  every  rule  l  — ►  r  €  72.2,  either  r  is 
skeleton  ^-normal  or  l  >-2  r.  Then  the  innermost  termination  of!Z\  implies  that  oflZ2.  If  all  the  erasure 
rules  in  S  are  not  argument- dropping  and  for  every  rule  l  -4  r  G  7£2,  either  r  is  skeleton  IZ2- terminating 
or  l  )*-2  r,  then  the  termination  oflZx  implies  that  oflZ2- 

Proof  Assume  that  1Z\  is  innermost  terminating  but  1Z  is  not.  Let  P(s)  be  a  property  on  terms  stating 
that  s  is  not  ^-terminating  but  all  proper  subterms  of  t  are  ^-terminating.  Since  ^2  is  well-founded  by 
Lemma  3,  we  can  choose  a  term  s  such  that  P(s)  holds  but  P(t)  fails  for  every  t  satisfying  s)^2  t  We 
can  prove  by  a  structural  induction  the  claim  that  t  is  P-terminating  for  every  t  such  that  all  terms  in 
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e(t)  are  in  7l2  normal  form  and  s  y2  t.  Please  see  the  proof  of  Theorem  4  for  details.  Since  P(s)  holds, 
there  exists  an  infinite  innermost  7^-rewriting  sequence  of  the  following  form, 

5  —  /(sl J  •  *  •  5  sm)  -~>Tl  f(s[  , .  •  •  ,  S^n)  =  s'  sn  —>72.  *  *  * 

where  s-  and  s[  are  in  7^-normal  form  for  i  =  1, . . .  ,m  and  sf  =  la  and  sn  =  r<j  and  /  — >  r  €  7£. 

/  r  must  be  a  rule  in  7£2  by  Lemma  1  (2)  and  r  clearly  cannot  be  skeleton  7£2 -normal.  Therefore, 

s'  y2  $"•  It  can  be  readily  proven  that  s  y2  s"  since  all  rules  in  71  are  ordered  under  ^i.  Now  let  us 
assume  that  s"  is  of  form  g{t\, . . . ,  tn).  We  do  a  case  analysis  on  the  form  of  |s"|. 

-  There  exists  no  rule  in  S  associated  with  g.  This  case  is  the  same  as  the  next  one. 

-  (#-o-(2i, . . . ,  %k))  €  S.  Then  we  have 

\s  |  =  ^(|^l  — 1 1 ,  1^  +  1 1,  •  •  •  ,  1^4-1  |j  | ?  *  •  •  >  |^n |) - 

Note  that  all  terms  in  e(s")  are  in  7^2-normal  form  since  \7Z\  are  ^-conservative.  We  have  |s|  ^ 
\s"\  >-i  \tj\  for  j  G  {1, . . .  ,n}  \  {ii, . . .  since  >^i  is  a  simplification  ordering.  Hence,  s  y2  tj.  With 
the  above  claim,  these  tj  are  ^-terminating  since  e(tj)  C  e(s").  Clearly,  . . . ,  Uk  are  ^-terminating 
and  this  implies  that  all  proper  subterms  of  sn  are  ^-terminating.  Hence  s”  is  ^-terminating  since 
P(sn)  holds  and  s  y2  s”.  We  have  thus  reached  a  contradiction. 

-  {g-p-i)  €  S.  Then  |s"|  =  \ti\.  We  have  s '  y2  U  by  the  definition  of  y2,  and  this  can  lead  to  s  y2  t{. 
With  the  above  claim,  t{  is  ^.-terminating  since  e(U)  C  e(s").  Clearly,  tj  are  ^-terminating  for  all 
j  6  {1, . . . ,  i  —  1,  i  -(- 1,  n).  Again,  this  implies  that  sn  is  7Z  terminating  since  P(s")  holds  and  s  y2  sn . 
This  is  a  contradiction,  terminating. 

Therefore,  7Z  must  be  terminating.  It  should  be  straightforward  to  prove  the  second  part  of  the  theorem. 

■ 

We  present  an  application  of  Theorem  3.  The  following  TRS  71  is  due  to  Dershowitz. 

(1)  -'(-'(x))  -4  X 

(2)  -i(x  A  y)  ->  ->(-’(i(x)))  V  ->{->{->(y))) 

(3)  -i(x  V  y)  -4  -.(-.(-.(x)))  A  ->(-»(->(y))) 

The  following  is  the  S-erasure  \7l\  of  71  for  S  —  {(~*-p-l)}. 

x  ->  x 

x  A  y  — »  x  V  y 
x  V  y  -»  x  A  y 

Clearly,  all  rules  in  \7Z\  are  ordered  in  the  RPO  with  the  precedence  A  «  V.  Let  >zi  denote  this  RPO.  We 
can  form  an  ordering  y2  with  the  precedence  relation  ->  )-  A,  V.as  described  in  Definition  7.  Notice  that 
the  right  side  of  (1)  is  P-skeleton  terminating  and  both  rules  (2)  and  (3)  can  be  ordered  under  y2.  By 
Theorem  3,  7Z  is  terminating  since  the  rule  in  S  is  not  argument-dropping. 

Please  see  Example  6  for  a  more  sophisticated  application  of  Theorem  3 

3.2  Nondeterministic  Erasure  Rules 

For  those  who  are  familiar  with  the  dependency  pair  approach  (DPA)  [AG97,AG98],  it  should  be  clear  that 
the  erasure  technique  presented  so  far  can  be  regarded  as  a  closely  related  idea  recast  into  the  framework 
of  termination  through  transformation.  However,  the  following  development  significantly  separates  ET 
from  DPA. 

Let  us  now  take  a  look  at  a  limitation  of  the  erasure  technique  developed  so  far  before  proceeding  to 
formulate  more  sophisticated  versions  of  ET.  The  rules  associated  with  if  are  the  following. 

if{true,x,y)  -»  x  if  {false,  x,y)  y 
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For  the  example  7£pg,  we  would  like  to  use  the  erasure  TRS  S  =  {(remove- p-2),  (if- p-3)}  so  that  we  can 
erase  the  following  rule  into  cons(y,ys)  -4  cons(y,ys). 

remove(x,cons(y,ys))  -4  if(x  =  y,remove(x,  ys),  cons(x,  remove(yi  ys)) 

Unfortunately,  we  also  obtain  the  erasure  y  -4  x  for  the  rule  if(true,x,y)  -4  x ,  which  is  not  a  legal 
rewrite  rule.  This  is  a  severe  limitation  in  practice  since  if  is  widely  used  in  defining  TRSs.  We  extend 
the  definition  of  erasure  to  resolve  this  problem. 

Definition  8.  Given  a  function  symbol  f  with  arity  n  and  1  <  U  <  •  •  •  <  ik  <  n,  the  following  nonde- 
terministic  rule  is  also  an  erasure  rule. 

(f~P~{t  1>  •  •  ■  » ^k))  / (®1>  *  •  *  >  ®n)  ” ^  {®ii  j  •  •  •  f  } 

This  means  that  f(x i, . . .  ,zn)  can  rewrite  to  Xij  for  each  1  <  j  <k.  This  rule  is  not  argument- dropping 
=  {l,...,n}. 

With  this  extension,  the  erasure  \t\  of  a  term  t  is  a  multiset  of  terms,  which  can  be  defined  as  follows. 
r  {t}  if  t  is  a  variable; 

f(\tl  | j  •  *  *  >  \ti{  —  1 1 >  Kii 4-1 1 >  *  *  *  5  Kifc— 1 1)  Kifc+l|i  •  •  •  j  \tn\) 

|t|  =  if  t  =  f(tu...,tn)  and  (f-o-(iu . . .  ,ik))  €  <S; 

KiJU^-U  \th\  if  t  =  tn)  and  (/-p-(ii, . . .  ,i*))  €  S ; 

/(Ki|,...,Kn|)  if  t  =  and  otherwise. 

We  use  the  notation  f(\t\ |, . . . ,  |£n|)  for  the  multiset 

{/(si5...,Sn)  |  Si  €  K;|  for  1  <  i  <  n}, 

that  is,  the  multiset  of  terms  f(s\ , . . .  ,sn),  where  s*  range  over  \U\  for  1  <  i  <  n.  We  also  present  the 
definition  for  e(£),  which  is  the  set  of  terms  erased  from  t. 

0  if  t  is  a  variable; 

{tix ,  •  •  •  iUk}  U  Uj€{l|...,n}\{<i 

if  t  =  f(ti,...,tn)  and  (/-o-(»i, . . •,**))  €  <S; 

{^1,  •  •  •  j  ^ii— 1  ^  ^ii4-l  ?  *  •  *  j  tik  —  l ,  t{k+i , . .  .  ,  tn}  U  c(^i!  )  U  *  *  *  U  c(^ifc  ) 

if  t  =  and  (/-p-(ii,...,ifc))  G  5; 

Ui€{i . n}  e(*j)  if  4  =  /(^i  >•••»*«)  and  otherwise. 

In  addition,  the  erasure  |C|  of  context  C  is  a  multiset,  in  which  every  element  is  either  a  context  or  a 
term.  The  erasure  \a\  of  substitution  a  with  a  finite  domain  is  defined  below. 

\a\  =  {r  |  dom(r)  =  dom(cr)  and  r(x)  €  |a(:r)|  for  every  x  €  dom(r)} 

Definition  9.  Let  y  be  an  ordering  on  terms .  We  extend  this  ordering  to  the  (nonempty)  multisets  of 
terms  as  follows:  S  ymax  (^™aa;)  T  if  and  only  if  for  every  t  £  T  there  is  an  s  €  S  such  that  s  y  (>-)  t , 
where  S  and  T  stand  for  the  multisets  of  terms . 

Please  notice  the  difference  between  ymax  and  y®.  For  instance,  we  have  {c(x)}  ymax  {a:,c(x)}  but 
{c(x)}  {x,c(a;)}.  Also  we  observe  that  ymax  is  well-founded  on  the  multisets  of  terms  if  y  is  well- 

founded  on  terms. 

Given  a  rule  I-4r,  the  erasure  of  this  rule  is  \l\  -4  |r|.  The  erasure  of  a  TRS  is  defined  similarly. 
Note  that  we  no  longer  consider  the  erasure  of  a  rule  (TRS)  as  a  rule  (TRS),  but  refer  it  as  a  rule  (TRS) 
erasure .  Given  a  reduction  ordering  y  on  terms,  we  say  that  the  rule  erasure  |/|  -4  \r\  is  strictly  ordered 
under  y  if  |2|  ymax  |r|,  and  it  is  ordered  if  |Z|  >:max  |r|. 
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For  instance,  for  5  =  {(remove- p-2),  (if- p-(2,3))},  the  5-erasure  of  7Zpg  is  the  following.  We  write  a 
term  for  the  singleton  set  consisting  of  the  term  to  support  transparent  syntax. 

(1')  nil  — >  nil  (2f)  cons(y,ys )  — >  {ys,cons(y,y$)} 

(3')  purge(nil)  -»  nil  (4')  purge(cons(x,  xs))  -4  cons(x,purge(xs)) 

Under  the  RPO  with  the  precedence  purge  y  cons ,  the  top  two  rule  erasures  are  ordered  and  the  rest  are 
strictly  ordered. 

Definition  10.  An  ordering  y  is  a  weak  reduction  ordering  if  its  strict  part  y  is  well-founded  and  stable 
under  substitutions  and  y  is  compatible  wrt.  term  structure  and  stable  under  substitutions.  Notice  that 
a  weak  reduction  ordering  y  may  not  be  a  reduction  ordering  since  it  is  not  required  that  y  be  also 
compatible  wrt.  term  structure. 

Lemma  4.  Let  y  be  a  weak  reduction  ordering  which  is  total  on  ground  terms.  Given  a  ground  substi¬ 
tution  a,  that  is,  a(x)  is  a  ground  term  for  every  x  G  dom(cr),  we  have  the  following  for  every  erasure 
TRSS. 

1.  There  exists  a  substitution  crmax  €  \cr\$  such  that  for  every  r  G  \o~\s,  &max(x)  h  rix)  hold  for  all 
x  G  dom(cr). 

2.  If  t  is  a  term  such  that  Var(t)  C  dom(cr),  then  for  every  S2  G  |£<7|s  there  exists  s\  G  \t\$  such  that 

Sl&max  ^2- 

Proof  For  every  x  G  dom(cr),  we  can  choose  a  term  tx  G  |cr(:r)|  such  that  tx  y  t  for  all  t  G  |cr(x)|  since 
y  is  total  on  ground  terms.  Let  crmax  be  the  substitution  with  domain  dom(tr)  and  cr7nax(x)  =  tx  for  all 
x  G  dom(a).  By  the  definition  of  |cr|,  we  obtain  (1).  (2)  can  be  readily  proven  by  a  structural  induction 
on  t.  ■ 

Notice  that  we  actually  only  require  that  the  weak  reduction  ordering  be  extendable  to  a  total  ordering 
on  ground  terms.  For  example,  reduction  orderings  based  on  RPOS  or  polynomial  interpretations  satisfy 
the  requirement. 

Definition  11.  Let  S  be  an  erasure  TRS.  For  every  weak  reduction  ordering  y  which  is  total  on  ground 
terms,  we  can  define  an  ordering  y™ax  as  follows. 

h  y^ax  t2  if  and  only  if  |£i|s  ymax  |£2|5 

The  next  proposition  states  a  crucial  property  of  y™ax . 

Proposition  3.  Given  a  weak  reduction  ordering  y  and  an  erasure  TRS  S,  the  ordering  on  terms 

is  also  a  weak  reduction  ordering. 

Proof  By  Lemma  4,  it  is  straightforward  to  prove  that  both  ymax  and  ^5  ax  are  stable  under  substitu¬ 
tions.  The  compatibility  of  y™1 ax  with  term  structure  follows  from  the  definition  of  the  erasure  function 
I -Is-  ■ 

In  general,  it  does  not  hold  that  ti  y™ax  t2  implies  C[ti]  y™ax  C[t2]  for  every  context  C  even  if  y  is  a 
reduction  ordering.  Therefore,  we  cannot  infer  that  y™' ax  is  a  reduction  ordering  under  the  assumption 
that  y  is. 

Theorem  4.  Let  TRS  71  be  the  hierarchical  combination  of  7Z\  and  1Z2  and  \7Zi\s  and  \7Z2\s  are  712- 
conservative  TRS  erasures  for  some  erasure  TRS  S.  Assume  that  y  is  a  weak  reduction  ordering  which 
is  total  on  ground  terms  and  l  y™ax  r  for  every  rule  /  4  r  G  K-i  and  l  r  for  every  rule  l  r  €7l2. 
Then  the  innermost  termination  of  implies  that  of  71. 
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Proof  Assume  that  72.i  is  innermost  terminating  but  7 Z  is  not.  Let  P(t)  be  a  property  on  terms  stating 
that  t  is  not  innermost  72-terminating  and  every  proper  subterm  of  t  is  innermost  terminating.  We  can 
choose  a  ground  term  t  such  that  P(t)  holds  and  P(s)  fails  for  every  term  s  satisfying  t  y™ax  s  since 
y™ax  is  well-founded.  We  now  prove  by  a  structural  induction  on  s  that  s  is  innermost  terminating  if 
t  y™ax  s  and  all  terms  in  e(s),s  are  innermost  terminating.  Assume  that  s  is  of  form  f(s\, . . .  ,  sn).  We 
do  a  case  analysis  on  the  form  of  |s|. 

-  No  erasure  rule  in  S  is  associated  with  /.  This  case  is  the  same  as  the  next  one. 

-  The  erasure  rule  (/-o-(ii, . . .  ,i*))  is  in  S .  Then 

M  “  /(|$l|>  •  •  •  j  |®<i— 1|>  I  i  -hi  I  ?  •**)  |5ifc—  1 1  ?  |Sifc+l|j  •  •  •  j  l^nl)* 

y  must  be  a  simplification  ordering  on  ground  terms  since  it  is  total  on  them.  Therefore,  for  every 
j  6  {1, . . .  ,n}\{n, . . .  ,i*},  t  y™-ax  s  y™ax  Sj ,  and  thus  sj  is  innermost  terminating  since  e(sj)  C  e(s) 
implies  that  all  terms  in  e(sj)  are  innermost  terminating.  Also  are  innermost  terminating  for 
1  <  j  <k  since  they  are  in  e(s).  Therefore,  all  proper  subterms  of  s  are  innermost  terminating.  Given 
the  property  of  t,  s  is  innermost  ^-terminating. 

-  The  erasure  rule  (/-p-(ii, . . .  ,  u))  is  in  S.  This  is  similar  to  the  previous  case. 

Thus  we  have  proven  the  claim  that  s  is  innermost  ^.-terminating  if  t  y™ax  s  and  all  terms  in  e(s)  are 
innermost  terminating. 

Assume  that  t  is  of  form  Since  t  is  not  innermost  ^-terminating  and  all  proper  terms 

of  t  are  innermost  71- terminating,  there  is  an  infinite  innermost  rewriting  sequence  beginning  with  the 
following  form, 

t  =  f{tu .  •  •  ,tn)  f(t[9 . . . ,  t'n)  =  t!  tn 

i 

where  U  t\  and  t[  are  in  7?.-normal  form  for  1  <  i  <  n,  and  tl  —  Ic r  and  t "  =  rcr  for  some  l  r  €  71. 
Note  that  /  cannot  be  a  defined  function  symbol  in  Tli  by  Lemma  1  (2).  Hence,  I  4  r  G  and  this 
implies  l  y™ax  r.  Therefore,  we  have  t  y™ax  t*  y^ax  t”.  Note  that  all  terms  in  e(t")  are  72.2-normal  since 
all  terms  in  e(r)  are  skeleton  7^.2 -normal.  Therefore,  all  terms  in  e(tn)  are  innermost  ^-terminating  by 
Lemma  1  (2).  This  implies  that  tn  is  7^-innermost  terminating  by  the  above  proven  claim,  contradicting 
the  assumption  that  t  is  not  innermost  ^-terminating.  Therefore  7Z  is  innermost  terminating.  ■ 

For  instance,  7£pg  can  be  readily  proven  innermost  terminating  with  Theorem  4.  We  will  present  in 
Appendix  A  more  realistic  examples  which  can  be  proven  (innermost)  terminating  with  the  applications 
of.  Theorem  1,  2  and  4.  We  regard  these  theorems  as  the  major  contribution  of  this  paper. 

We  now  present  a  theorem  to  demonstrate  that  the  erasure  technique  can  also  be  directly  applied  to 
termination  proofs.  We  first  establish  a  lemma  needed  later. 

Lemma  5.  Let  y  be  an  ordering  on  terms.  For  multisets  5,  7\  and  of  terms,  if  T\  ymax  then 
SuTi  y®Sl)T2. 

Proof  The  lemma  immediately  follows  from  the  definition  of  ymax  and  >-®.  ■ 

Theorem  5.  Let  S  be  an  erasure  TRS  in  which  all  the  rules  are  not  argument- dropping  and  y  be  a 
reduction  ordering  which  is  total  on  ground  terms.  Assume  71  =  7li  U  72.2  such  that 

-  |/|  =  |r|  for  all  rules  l  -4  r  €  7l\,  and 

-  \l\  \  |r|  ymax  |r|  \  \l\  for  all  rules  l  -»  r  €  72.2* 

Then  the  termination  of7Z\  implies  that  of  72.2* 
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Proof  (sketch)  Assume  that  ti  t2/(C,l  -4r,a).  It  suffices  to  prove  that  |£i|  >:®  \t2\  if  l  r  6  Hi 
and  |^i  |  >-®  \t2\  if  l  r  €  7l2. 

If  l  ->  r  €  7ii,  then  \l\  =  |r|,  which  implies  that  \ti\  =  We  now  assume  that  /  — ►  r  €  7^2-  Let 
\C\  =  {Ci, . . . ,  Ck }•  Since  all  rules  in  S  are  not  argument-dropping,  every  Ci  is  a  context  for  1  <  i  <  k. 
Let  us  define  multisets  5,  T\ ,  T2  of  terms  as  follows. 

S  =  Ui<t<fc{^[5]  i  5  €  |foj  and  t  G  |/|  H  |r|} 

T1  =  Ol<i<k{CiiS\  I  S  G  M  and  \l\  \  lrD 

t2  =  Ui<i<*{^<W  1 5  e  M  and  te  lrl  \  \l\} 

It  can  be  readily  proven  with  Lemma  4  that  T\  ymax  T2  holds.  Therefore,  Also  we  can  show  |£i|  =  SuTi 
and  \t2\  —  S  U  T2.  By  Lemma  5,  we  have  \ti\  >-®  |^2 1-  ■ 

Theorem  5  immediately  strengthens  Theorem  12  (1)  in  [Zan94],  where  it  is  required  that  /  does  not  occur 
in  l  for  every  Z  — >•  r  E  7£  if  the  erasure  rule  (/-p-(l, . . . ,  n))  is  included  in  S  3.  Applications  of  Theorem  5 
can  be  found  in  Appendix  A. 

4  Related  Work 

There  is  a  large  number  of  results  in  the  literature  concerning  termination  proofs  for  various  modular 
combinations  of  TRSs.  We  refer  the  reader  to  [Der95]  for  some  clean  explanation  on  many  significant 
results  in  this  area.  The  general  scenario  is  to  prove  the  termination  of  7Zi  U  1Z2  for  terminating  TRSs 
7?.i  and  TZ2  under  some  assumption  on  the  relation  between  IZi  and  7Z2 .  We  have  found  that  most  of  the 
results  such  as  the  ones  mentioned  in  [Der95],  though  interesting,  make  assumptions  about  7Z\  and  7l2 
which  are  too  strong  for  the  purpose  of  verifying  the  termination  of  hierarchical  combination  of  1Z\  and 
H2,  sometimes. 

We  are  most  interested  in  the  case  of  hierarchical  combination  of  1Z\  and  1Z,2  where  the  defined 
function  symbols  in  are  used  in  1Z2  in  an  essential  way  since  this  closely  resembles  the  structure  of  a 
functional  or  logic  program.  This  almost  forces  us  to  know  the  semantics  of  to  certain  extent  in  order 
to  prove  the  termination  of  the  combined  system.  ET  is  proposed  to  address  the  issue  in  a  (very)  restricted 
manner.  For  instance,  the  use  of  the  projection  rule  (remove- p-2)  in  the  7lpg  example  is  simply  to  test 
that  remove (x,  ys)  can  never  return  a  list  of  length  greater  than  that  of  ys.  This  test  succeeds  because 
the  generated  erasure  of  7 £pg  can  be  ordered.  Let  7£pg  be  7£pg  in  which  the  rule  remove(x,nil )  ->  nil  is 
replaced  with  another  rule  remove(x,nil)  -»  cons(x,nil ),  then  the  test  will  fail  on  7£pg  since  we  cannot 
order  nil  — »  cons(x,nil).  Notice  that  7£pg  is  not  terminating.  This  immediately  implies  that  none  of  the 
results  mentioned  in  [Der95]  can  give  modular  termination  proofs  for  7£pg.  If  they  could,  they  would  also 
prove  this  for  7£pg  since  7£pg  and  7£pg  exhibit  the  very  same  characteristics  to  them. 

The  dependency  pair  approach  (DPA)  [AG97,AG98],  which  inspired  our  work  on  erasure,  deserves 
special  mentioning.  We  regard  ET  as  a  similar  idea  cast  into  the  general  framework  of  termination 
through  transformation.  The  technical  explanation  is  that,  to  a  large  extent,  erasure  amounts  to  the  use 
of  weak  reduction  orderings,  which  are  referred  as  weakly  monotonic  orderings  stable  under  substitutions 
in  papers  on  DPA.  In  general,  DPA  seems  more  powerful  than  ET  but  it  is  also  (in  our  opinion)  more 
involved.  For  instance,  DPA  uses  unification  to  detect  circles  of  dependency  pairs  and  the  set  of  usable 
rules,  but  this  is  currently  unavailable  in  ET.  However,  this  seems  to  be  a  less  significant  issue  so  far 
in  our  experiment,  especially,  after  we  combine  ET  with  the  freezing  technique  [Xi98].  We  also  plan  to 
incorporate  similar  ideas  into  ET  if  the  needs  appear.  We  feel  that  the  most  significant  advantage  of  ET 
over  DPA  is  the  availability  of  nondeterministic  erasure  rules.  Because  of  the  lack  of  a  similar  feature, 

3  A  strengthened  version  of  this  theorem  is  proven  in  [MOZ96]  which  does  allow  the  occurrences  of  /  on  the 
left-hand  sides  of  the  rules,  but  it  is  nonetheless  essentially  different  from  Theorem  5.  Please  see  Example  4  in 
Appendix  A. 
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DPA  is  often  awkward  in  handling  conditional  if.  For  instance,  we  must  order  the  following  rule 

remove(x,  cons (y,  ys))  -4  if{x  =  y,  remove (x,  ys),  cons (x,  remove (y,  ys))) 

with  a  weakly  monotone  ordering  if  7 Zpg  is  to  be  proven  terminating  using  DPA.  Suppose  that  we 
use  RPOS  as  the  underlying  approach  to  ordering  the  rule.  We  cannot  assume  remove  >~  cons  in 
the  precedence  relation  since  this  prevents  us  from  strictly  ordering  the  following  generated  depen¬ 
dency  pair  PURGE (cons{x,xs))  >  P URGE( remove (x,xs)).  If  we  map  if(b,x,y )  to  x  (y),  then  the  rule 
if  (false,  x,y)  -4  y  {if(true,x,y)  -4  x)  cannot  be  ordered.  As  a  consequence,  the  if  function  often  needs 
to  be  “preprocessed”  away  when  DPA  is  applied  because  it  is  difficult  to  synthesize  a  weakly  monotone 
ordering  based  on  RPOS  or  polynomial  interpretations  in  the  presence  of  if  to  order  the  generated  de¬ 
pendency  pairs.  For  instance,  the  following  rules  are  introduced  in  the  technical  report  version  of  [AG97] 
for  handling  the  purge  function  example. 

remove(x,cons(y,ys))  -4  ifremove( x  =  y,x,con${y,ys)) 
ifremove(true,x,  cons(y,y$))  -4  remove(x,ys ) 
ifremove(false,x,  cons(y,ys))  -4  cons(y,  remove (x,  ys)) 

Though  the  argument  is  that  the  introduction  of  these  rules  is  to  forbid  rewriting  terms  under  if-branches 
until  the  condition  is  resolved,  we  feel  that  this  is  also  a  bit  unnatural  at  least  since  realistic  TRSs  are 
seldom  formed  in  such  a  manner.  Notice  that  the  termination  of  TZpg  is  independent  of  whether  we  rewrite 
terms  under  if-branches  or  not.  It  seems  straightforward  to  make  use  of  the  weak  reduction  ordering  ymax 
in  DPA  for  handling  if,  and  this  can  elegantly  resolve  the  above  issue.  We  will  use  some  concrete  examples 
to  further  compare  ET  or  ET  plus  the  freezing  technique  with  DPA  in  Appendix  A. 

The  use  of  projection  erasure  rules  bears  some  resemblance  to  distribution  elimination  [Zan94],  but 
there  are  also  many  significant  differences.  Although  it  is  clearly  possible,  there  seems  no  attempt  in 
[Zan94]  to  construct  the  ordering  ymax  from  a  given  weak  reduction  ordering  V,  which  we  regard  as  a 
significant  contribution  of  the  paper.  Also  we  mention  that  the  use  of  an  omitting  rule  (/-o-(l, . . .  ,n))  in 
case  *4r(/)  =  n  casually  relates  to  dummy  elimination  [Fer96]. 

5  Conclusion 

We  have  presented  a  technique  named  erasure  to  facilitate  the  termination  and  innermost  termination 
proofs,  and  this  technique  is  inspired  by  the  dependency  pair  approach  in  the  literature.  The  erasure 
technique  (ET)  is  simple  to  apply  and  effective  in  practice,  and  therefore  is  reasonable  to  expect  that 
ET  can  be  combined  with  other  automated  approaches  to  termination  proofs  for  TRS  such  as  freezing 
[Xi98].  However,  we  observe  in  practice  that  it  is  even  difficult  to  scale  an  approach  as  simple  as  RPOS. 
This  makes  us  believe  that  a  more  promising  direction  is  to  apply  ET  interactively.  In  this  respect,  we 
have  tried  ET  extensively  on  various  TRSs  and  the  results  are  encouraging.  We  present  some  examples 
in  Appendix  A  to  substantiate  this  claim. 

In  general,  we  are  highly  motivated  to  look  for  approaches  to  termination  proofs  for  TRSs  which  are 
simple  and  effective.  We  intend  to  integrate  these  approaches  into  an  interactive  termination  prover  for 
TRSs.  The  user  may  be  required  to  interact  when  applying  these  approaches  but  the  needed  interaction 
should  not  be  overwhelming.  We  view  this  as  promising  direction  to  pursue  so  as  to  address  the  following 
dilemma:  too  much  automation  can  severely  hinder  the  scalability  of  a  termination  proof  procedure  for 
TRSs  while  too  little  can  easily  lead  to  an  amount  of  required  interaction  which  is  simply  overwhelming 
for  the  user.  This  should  be  especially  clear  to  those  who  have  used  interactive  theorem  provers.such  as 
PVS  [ORR+96]  or  Isabelle  [Law94]  for  proving  the  termination  of  recursively  defined  functions. 
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A  Examples 

Example  1.  We  often  combine  ET  with  the  freezing  technique  [Xi98]  in  practice.  Let  1Z( act  be  the  following 
TRS  [KL80]. 

p(s(0))  -4  0  fact( 0)  -4  s(0) 

p(s(s(x)))  -4  s(p(s(x)))  fact{s{x))  -4  s(x)  *  fact(p(s(x))) 

The  following  72.fact  is  a  (p,  s,ps,  l)-frozen  version  of  7£fact,  and  therefore  the  termination  of  7Z}act  implies 
that  of  7?.fact  • 

(1)  p(s(0))  -4  0  (4)  ps(s(x))  -4-  s(ps(x)) 

(2)  ps(0)  -4  0  (5)  fact( 0)  -4  s(0) 

(3)  p(s(s(x)))  -4  s(ps(x))  (6)  fact(s(x))  -4  s(x)  *  fact(ps(x)) 

The  following  7?|act  is  the  <S-erasure  of  7Z[act  for  S  =  {ps-p-1}. 

(T)  p(s(0))  -4  0  (4')  s(x)  -4  s(x) 

(2')  0-4  0  (5'j  fact( 0)  -4  s(0) 

(3')  p(s(s(i)))-4«(i)  (6')  fact(s(x))  -4  s(x)  *  fact(x) 

Under  the  RPO  with  the  precedence  fact  y  *,  rules  (2')  and  (4()  can  be  ordered  and  the  rest  of  the 
rules  can  be  strictly  ordered.  Since  the  TRS  consisting  of  rules  (2)  and  (4)  is  obviously  terminating,  the 
termination  of  7Zfact  follows  from  Theorem  1.  Therefore,  7Zfact  is  terminating  by  a  theorem  on  the  freezing 
technique. 

If  we  apply  DPA  to  7?fact ,  the  following  dependency  pair  is  generated. 

FACT(s(x))  >  FACT(p(s(x ))) 

It  is  unclear  how  this  can  be  strictly  ordered  since  we  cannot  project  away  the  argument  of  p  because  of 
the  existence  of  the  rule  p(s(s(x)))  -4  s(p(s(x))).  If  one  argues  that  this  example  is  too  contrived,  then 
the  following  example  exhibits  the  same  characteristics. 

Example  2.  In  the  following  TRS  7?.iog,  h(n)  =  [n/2j  for  every  natural  number  n,  and  log(n)  =  l+log2(n) 
for  n  >  0. 

h( 0)  0  log(0)  -4  0 

h{s( 0))  -4  0  log(s(x))  ->  s(log(/i(s(x)))) 

h(s(s(x)))  -4  s(h(x )) 

The  last  rule  is  self-embedding,  and  therefore  the  termination  of  this  TRS  cannot  be  proven  with  a 
simplification  ordering.  We  form  an  (h,  s ,  1,  /i.s'Vfrozen  version  Ejog  of  7Uog  as  follows. 


(1) 

h{ 0)  -4  0 

(5) 

hs($(2))  s(h(x)) 

(2) 

h{s{ 0))  -4  0 

(6) 

log(0)  — >  0 

(3) 

hs{ 0)  -4  0 

(7) 

log (s{x))  ->  s(\og(hs(x))) 

(4) 

h(s(s(x)))  -4  s(h(x)) 

We  can  prove  the  termination  of  7Zfog  by  forming  its  5-erasure  for  S  =  {/i-p-1,  hs-v-l] .  This  then  implies 
the  termination  of  7£iog.  We  omit  the  details  that  are  straightforward  to  fill  in.  Notice  that  this  example 
can  not  handled  by  DPA  for  the  same  reason  as  explained  in  the  previous  example. 

In  general,  we  intend  to  apply  various  transformations  for  proving  the  (innermost)  termination  of  a 
TRS  TZ.  We  generate  a  chain  of  TRSs  TZ  =  TZ\, TZ^, . . .  ,lZn  such  that  the  (innermost)  termination  of 
TZi+i  implies  that  of  7?.,  for  1  <  i  <  n  and  the  (innermost)  termination  of  1Zn  can  be  proven  with  some 
basic  approach  such  as  RPOS  or  polynomial  interpretations.  The  problem  with  DPA  is  that  it  generates 
a  set  of  dependency  pairs  rather  than  a  TRS,  and  therefore  it  is  difficult  to  be  combined  with  other 
transformational  approaches. 
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Example  3.  The  following  TRS  72,qs  defines  a  quicksort  function  on  lists.  Let  Hi  consist  of  all  these  rules 
except  the  last  2,  and  7£2  consist  of  the  last  2  rules.  Then  72,qs  is  the  hierarchical  combination  of  Hi  and 
ft2. 


(1)  if(true,x,y)  — >  x 

(2)  if  (false,  x,  y)  ->  y 

(3)  0  <  x  — »  true 

(4)  s(x)  <  0  — >  false 

(5)  s(x)  <  s(y)  -+  x  <  y 

(6)  gte(x,  D)  “>  D 

(7)  gte(x ,  y  ::  ys)  if  (x  <  y,  y  ::  gte(x ,  ys),gte(x ,  ys)) 

(8)  fc(*,D)->fl 

(9)  lt(x ,  y  ::  ys)  -M/(x  <  y,  Zt(x, ys),  y  ::  Zt(x, ys)) 

(10)  []  @  ys  ys 

(11)  (a  ::  xs)  @  ys  a;  ::  (xs  @  ys) 

(12)  ymcfcsor£([|)  ->•  [] 

(13)  quicksort(x  ::  xs)  -¥  quick  sort(lt(x,xs))  @  [rr]  @  <7mcfcsort(yZe(x,xs)) 
The  following  is  the  *S-erasure  of  72,  for  S  =  {(z/-p-(2,3)),  (gte- p-2),  (Zt~p-2)}. 


(!') 

R  y) 

(2') 

Ry} 

(3') 

0  <  x 

(4') 

s(x)  <  0 

(S') 

s(x)  <  s(y) 

(6') 

[] 

(7') 

y  ::  ys 

(S') 

[] 

(9') 

y  ::  ys 

(10') 

[]  @  2/5 

(11') 

(x  ::  xs)  @  ys 

(12') 

quicksort ([]) 

(13') 

quicksort(x  ::  xs) 

x 

y 

true 
false 
x  <  y 

o 

{y  ::  ys,ys} 

[] 

{ys,y::ys} 

ys 

x  ::  (xs  @  ys) 

o 

quicksort(xs)  @  [x]  @  quicksort(xs) 


Under  the  RPO  with  precedence  quicksort  y  <  y  true,  false,  all  rule  erasures  in  Hi  can  be  ordered 
and  all  rule  erasures  in  72.2  can  be  strictly  ordered.  By  Theorem  4,  the  innermost  termination  of  72.qs 
follows  from  that  of  Hi.  It  can  be  readily  proven  with  a  RPO  that  1Z\  is  (innermost)  terminating,  and 
therefore  7£qs  is  innermost  terminating.  This  implies  that  72.qs  is  terminating  since  it  is  non-overlapping. 

A  similar  example  also  appears  in  the  technical  report  version  of  [AG97],  but  if  is  “preprocessed” 
away.  The  termination  of  that  example  can  be  readily  proven  with  Theorem  2. 


Example  4-  Let  Hi  be  the  following  TRS. 

(1)  m  ->  0 

(2)  f  (branch^,  x))  ->  branch($,  f(x)) 

(3)  f(branch(branch(x,y),z))  — >  f(branch(x,branch(y,z))) 

(4)  y(0)  — V  0 

(5)  g(branch(x,  0))  — >  branch (0,  g(x)) 

(6)  g(branch(x ,  branch(y ,  z)))  ->  g(branch(branch(x ,  y),  z)) 
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The  following  is  the  5-erasure  of  TZ\  for  5  =  {branch- p-(l,2)}. 


(!') 

m 

-> 

0 

(2') 

{/(«,/(*)} 

-> 

{0,/(*)> 

(3') 

(4') 

5(0) 

-4 

0 

(S') 

{<?(*), 5(0)} 

{0,5(*)} 

(6') 

-> 

{g(x),g(y),g(z)} 

By  Theorem  5,  the  termination  of  TZ\  follows  from  the  termination  of  H2  =  {(3),  (6)}.  We  now  construct 
a  TRS  71s  below,  which  is  an  (f, branch,  l,fbranch)-frozen  version  of  7^2- 


fbranch(branch(x ,  y),z)  -4  fbranch(x ,  branch(y,  z)) 
g(branch(x,branch(y,z)))  ->  ^(6ranc/i(6ranc/i(a;,  j/),z)) 

The  termination  of  7^3  is  easily  proven  with  a  RPOS,  and  therefore,  7£i  is  terminating.  We  point  out  that 
it  would  be  greatly  involved  (though  possible)  if  we  applied  the  freezing  technique  to  directly. 

Notice  that  Theorem  12  [Zan94]  cannot  be  applied  to  this  example  since  /  has  occurrences  on  the  left- 
hand  sides  of  the  rules.  The  strengthened  version  of  this  theorem  in  [MOZ96]  cannot  handle  this  example, 
either. 


Example  5.  The  termination  of  the  following  TRS  Go  describes  the  process  of  substitution  in  combinatory 
logic,  and  the  proof  for  the  termination  of  Go  in  [CHR92]  is  involved.  Some  simplified  proofs  have  been 
given  in  [Zan94,Zan95]. 


(1)  X(x)  0  y  ->  \{x  0  (1  •  (y  0  f))) 

(5)  1  0  id  -»  1 

(2)  (x  ■  y)  0  Z  (x  0  z)  ■  (y  o  z) 

(6)  1  0  (x  ■  y)  ->  x 

(3)  {x  O  y)  0  Z  X  O  [y  0  z) 

(4)  id  0  x  x 

(7)  t°  (x-y)->y 

The  following  is 

the  5-erasure  of  gq  for  5  =  {(* -p-(l,  2))}. 

(1') 

X(x)  0  y  ->  {\(x  0  1),  X(x  0  (y  0  f))} 

(5y)  1  0  id  ->  1 

(2') 

{x  0  z,y  0  z}  {x  0  z,  y  0  z} 

(6;)  {1  0  x,  1  0  y]  x 

(3') 

(x  0  y)  0  z  -4  x  0  (y  0  z) 

(7')  {t  0  t  0  2/>  — ^ 

(4') 

id  0  x  x 

As  shown  in  [Zan94],  all  the  rule  erasures  except  the  second  one  can  be  strictly  ordered  under  a  total 
ordering.  By  Theorem  5,  the  termination  of  ao  follows  from  the  termination  of  the  TRS  consisting  of  the 
rule  (x  *  y)  o  z  -*>  (x  o  z)  •  (y  o  z),  which  is  obvious.  Notice  that  the  distribution  elimination  technique 
[Zan94]  cannot  be  directly  applied  to  this  example  because  of  the  occurrences  of  •  on  the  left-hand  sides 
of  some  rules.  If  we  replace  the  last  rule  in  tr0  with  f  o  (x  ■  y)  ->  y  *  x,  then  the  strategy  used  in  [Zan94] 
would  no  longer  work  but  Theorem  5  could  still  be  applied. 


Example  6.  The  following  example  is  adopted  from  the  technical  report  version  of  [AG97],  where  it  is 
formed  as  a  variation  of  an  algorithm  in  [?].  The  purpose  of  the  function  rename(x,y,t)  is  to  replace 
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every  free  occurrence  of  the  variable  x  in  the  term  t  with  the  variable  y. 


(1) 

true  A  y  ->  y 

(2) 

false  A  y  false 

(3) 

[]  =  []-+  true 

(4) 

(x  ::  xs)  =  []-»■  false 

(5) 

[]  =  (?/::  ys)  ->  false 

(6) 

{x  ::  xs)  =  (y  ::  ys)  -+  (x  ==  y)  A  (xs  =  ys) 

(7) 

uar(xs)  =  var{ys)  ->  xs  =  ys 

(8) 

var{xs)  =  apply{s,t)  — >  /a/se 

(9) 

uar(xs)  =  lambda(x,  s)  false 

(10) 

apply  (s,  t)  —  var(ys)  — >  /a/se 

(11) 

apply  {s,  t)  =  apply  {u,  v)  {s  =  u)  A  {t  =  u) 

(12) 

apply (s,t)  =  lambda{x,u)  -+  /a/se 

(13) 

lambda(x,  s)  =  var(ys)  /a/se 

(14) 

lambda(x,  s)  =  apply  (u,  v)  -+  /a/se 

(15) 

lambda{x ,  s)  =  lambda(y,  t)  -+  (x  =  y)  A  (s  =  £) 

(16) 

Z/(£rae,t>ar(xs),/aar(ys))  -+  mr(xs) 

(17) 

if  {false,  var{xs),var{ys))  -+  uar(ys) 

(18)  rename{var{xs),  var{ys),var{zs))  -»  if(xs  =  zs,var{ys),var{zs)) 

(19)  rename(x ,  y ,  apply($ ,  t))  -+  apply  {rename{x,  y,  s),rename{x ,  y,  £)) 

(20)  rename(x ,  y,  lambda{z,  t))  — >  lambda (•,  rename(x ,  y,  rename(z ,  •,  £))) 

Note  that  •  in  rule  (20)  stands  for  uar([x,y,  lambda{z,  £)]).  Let  7^i  consist  of  the  first  17  rules  and  72-2 
consist  of  the  rest  of  rules.  Then  7£  =  72-i  U  7^2  is  a  hierachical  combination  of  7£i  and  7^2-  Clearly, 
7^i  can  be  proven  terminating  with  some  RPO.  We  form  the  following  5-erasure  |7£|  of  7Z  for  S  = 
{(A-p-2),  (=-o-(l,2)),  (var-o-l),  (z/-o-(l,2,3)),  (rename-p-3),  {lambda- o-l)}. 


(i') 

2/  ->2/ 

(2') 

y  -» /a/se 

(3') 

=  — >  true 

(4') 

=  -+  /a/se 

(S') 

=  — >  /a/se 

(6') 

=  -+  = 

(7') 

=  ->  = 

(80 

=  ->  /a/se 

(90 

=  -+  /a/se 

(100 

=  -+  /a/se 

(110 

(120 

=  -+  = 

=  ->  /a/se 

(130 

=  -+  /a/se 

(140 

=  — >  /a/se 

(150 

=  -+  = 

(160 

if  — >  war 

(170 

i/  -+  war 

(180 

war  — »  i/ 

(190  apply{s,  t)  ->  apply  {s,  t) 

(200  lambda{t )  ->  lambda{t) 

It  can  be  readily  verified  that  [R\  is  ^-conservative.  Under  the  RPO  with  the  precedence  relation 
true  «  false  «  =  «  var  =  i/,  all  the  rules  can  be  ordered.  Note  that  rule  (20  is  ordered  because  we  can 
require  that  false  be  a  constant  with  the  lowest  precedence.  Let  yx  denote  this  RPO.  We  can  then  form 
an  ordering  ^  with  the  precedence  rename  y  apply,  lambda  as  described  in  Definition  7.  Then  the  right 


side  of  rule  (18)  is  ^-skeleton  normal,  and  both  rules  (19)  and  (20)  can  be  ordered  under  >-2-  Therefore, 
71  is  terminating  by  Theorem  3. 
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